Deep Learning-Based Multiomics Data Integration

Full text

Turn on search term navigation

Introduction

Driven by medical radiomics and high-throughput techniques, there has been explosive growth in the amount of advanced multiomics data in modern biomedical research. These data contain abundant information reflecting cellular context, allowing researchers to disentangle biomolecular mechanisms, and gain a more comprehensive understanding of biological processes related to complex diseases.^[ ^1,2 ^] Single-omics data provide limited information on biological systems. The systematic study of biomedical objects by integrating and analyzing data at diverse scales is an emerging field.^[ ^3–5 ^] The integration of multiomics data makes it possible to understand complex biological systems from different perspectives.^[ ^5,6 ^] Using multiomics data can identify the best therapeutic targets and address key biomedical objects, including personalized complex disease therapy,^[ ^7–11 ^] drug discovery,^[ ^12–15 ^] and drug target discovery.^[ ^16–20 ^]

However, multiomics data are complex, high dimensional, and heterogeneous.^[ ^21,22 ^] By far the most important challenge is how to extract valuable knowledge from these data. Since various omics data are obtained by different measurement techniques, the data distribution for different omics is different. The inconsistency of data distribution is one of the difficulties to be overcome. At the same time, each sample contains data from multiple omics, and exploring potential correlations between these data is also a problem. Biomedical data is also characterized by high dimensionality and few samples, which can generate the curse of dimensionality during data mining and reduce the generalization ability of the model. To address this challenge, researchers have developed methods such as multiple kernel learning, Bayesian approaches, dimensionality reduction approaches, network-based methods, and deep learning (DL) methods.^[ ^23–30 ^] With the continuous development of DL, up to now, many DL methods have emerged to integrate multiomics data. The DL methods can be utilized as an efficient framework to process a large amount of multiomics, high-dimensional, and complex data. The DL-based methods can capture the typical nonlinearities and complex relationships in biological data to achieve more accurate predictions. DL, represented by multiomics data modeling, has achieved substantial success in biomedical fields. The combination of DL and multiomics data is significantly beneficial for the development of modern biomedical studies.

In this article, we mainly focused on how to effectively extract omics data representations and how to integrate these representations. We classified multiomics integration models into six categories according to the framework of DL-based multiomics data integration methods. In addition, we reviewed and discussed the application of these methods in the biomedical field, including clinical prediction, biomarker identification, drug sensitivity prediction, synthetic lethality (SL) prediction, and single cell-related studies (Figure 1 ).

Figure 1. Multiomics data integration method using state-of-the-art DL models and their biomedical field applications.

Summary of DL-Based Multiomics Data Integration Methods

We conducted a comprehensive survey on the state-of-the-art DL-based multiomics data integration methods in the biomedical field. Meanwhile, we classified these methods into six categories based on fully connected neural network (FCNN), convolutional neural network (CNNs), autoencoder (AE), graph neural network (GNN), capsule network (CapsNet), and generative adversarial network (GAN) (Table 1 ). The methods from each category are detailed in the following sections (Table 2 ).

Table 1 Categories about the DL-based multiomics integration methods

Method		Description	Ref.
FCNN		NN with only fully connected layers	[32–37,84–89]
CNN		NN with convolutional layer and pooling layer	[41,42,44,90–94]
AE
	Standard AE	A type of neural network, including an input layer, a hidden layer, and an output layer. The output layer and the input layer have the same dimension.	[47,48,95–97]
	DAE	Autoencoder with noise in the input layer	[49–51]
	SAE	An autoencoder with multiple hidden layers	[53,98,99]
	CrossAE	A method of using the middle layer of an autoencoder to reconstruct other autoencoders	[54]
	VAE	An autoencoder whose hidden layer features obey a certain probability distribution	[58–60,62,63,100–102]
GNN		A class of NNs for processing data represented by graph data structures	[68–71]
CapsNet		A convolutional NN with a capsule structure added	[74,75]
GAN		A NN for generating samples containing generators and discriminators	[77]

Table 2 Biomedical application of the six-category models

a)	Model	Feature types	Data source	Application fields
Huang et al.^[ ³² ^]	FCNN	gene expression, miRNA expression, copy number burden, tumor mutation burden, demographical, and clinical information	TCGA and cBioPortal	Prognosis
Sharifi-Noghabi et al.^[ ³³ ^]	FCNN	Somatic point mutation, copy number aberration, and gene expression data	GDSC, PDX Encyclopedia, and TCGA	Drug response prediction
Lin et al.^[ ³⁴ ^]	FCNN	gene expression, DNA methylation data, and copy number variation, and clinical information	TCGA	Cancer subtypes classification
Preuer et al.^[ ³⁵ ^]	FCNN	The chemical descriptors for drug A and drug B, and the genomic information of the cell line	O’Neil dataset^[ ¹⁰³ ^] and ArrayExpress	Synergistic drug combination prediction
Kuru^[ ³⁶ ^]	FCNN	Gene expression data and the drug pairs’ chemical structure	DrugComb and ArrayExpress	Synergistic drug combination prediction
Bica et al.^[ ³⁷ ^]	FCNN	Gene expression data and DNA methylation data	TCGA	Cancer subtypes classification
Fu et al.^[ ⁴¹ ^]	CNN	Variation counts, expression level data, QTANs/QTALs number, and WGCNA module features	SRA, ENA, GEO, PubMed, ARKdb genome and NCBI Nucleotide database	Gene regulation mechanisms
Islam et al.^[ ⁴² ^]	CNN	Copy number variation, gene expression data and clinical data	METABRIC project	Cancer subtypes classification
Wu et al.^[ ⁴⁴ ^]	CNN	Multiview high-resolution CT images	Collected from three hospitals in China	Medical imaging-assisted diagnosis
Zhang et al.^[ ⁴⁷ ^]	AE	Gene expression data, copy number variation, genetic mutation data, and drug physicochemical	TCGA and CCLE	Synergistic drug combination prediction
Yang et al.^[ ⁴⁸ ^]	AE	Gene expression, DNA methylation and miRNA expression data	TCGA	Cancer subtypes classification
Seal et al.^[ ⁴⁹ ^]	DAE	DNA methylation, copy number variation and RNA-seq data	TCGA	Biomarker identification
Poirion et al.^[ ⁵⁰ ^]	DAE	Gene expression, miRNA expression and methylation data	TCGA and GEO	Cancer subtypes classification
Guo et al.^[ ⁵¹ ^]	DAE	Gene expression, miRNA expression and copy number variation	TCGA and GEO	Cancer subtypes classification
Yuan et al.^[ ⁵³ ^]	SAE	Multi-channel scalp electroencephalogram signals	CHB-MIT dataset	Medical imaging-assisted diagnosis
Tong et al.^[ ⁵⁴ ^]	Cross-AE	Gene Expression, DNA methylation, copy number variation and miRNA expression data	MNIST and TCGA	Prognosis
Zuo et al.^[ ⁵⁸ ^]	VAE	scRNA-seq data, scATAC-seq data	The cell line mixture dataset, AdBrainCortex and three simulated datasets	Single cell-related study
Ronen et al.^[ ⁵⁹ ^]	VAE	Gene expression, point mutations, and copy number variation	TCGA	Cancer subtypes classification
Zhang et al.^[ ⁶⁰ ^]	OmiVAE	Gene expression and DNA methylation data	TCGA	Cancer subtypes classification
Hira et al.^[ ⁶² ^]	MMD-VAE	DNA methylation, gene expression and copy number variation	TCGA	Prognosis
Jiang et al.^[ ⁶⁸ ^]	GNN	Drug-drug synergy data, drug-target interaction data and PPI data	O’Neil dataset, STITCH, STRING and BioGRID	Synergistic drug combination prediction
Hao et al.^[ ⁶⁹ ^]	GNN	SL-gene pairs data, mutation, copy number variation and gene expression	BioGRID, SynLethDB and TCGA	SL prediction
Tang et al.^[ ⁷⁰ ^]	GNN	miRNA and disease information	miRBase, HMDD, MeSH, DisGeNET, HumanNet and miR2disease database	Biomarker identification
Wang et al.^[ ⁷¹ ^]	MOGONET	Gene expression data, DNA methylation data, miRNA expression data	ROSMAP and TCGA	Prognosis and biomarker identification
Afshar et al.^[ ⁷⁴ ^]	CapsNet	3D local and global features	LIDC-IDRI	Medical imaging-assisted diagnosis
Peng et al.^[ ⁷⁵ ^]	CapsNetMMD	Gene expression, DNA methylation and DNA copy-number alterations	TCGA	Gene discovery
Ahmed et al.^[ ⁷⁷ ^]	OmicsGAN	Gene expression, miRNA expression data and clinical information	TCGA and TargetScanHuman	Cancer subtypes classification and prognosis

^a)The Cancer Genome Atlas (TCGA); CCLE, Cancer Cell Line Encyclopedia; PDX, patient-derived xenograft; QTANs, quantitative trait-associated nucleotides; QTALs, quantitative trait-associated loci; SRA, Sequence Read Archive; ENA, the European Nucleotide Archive; GEO, Gene Expression Omnibus; METABRIC, Molecular Taxonomy of Breast Cancer International Consortium; CHB, Children's Hospital Boston; MIT, Massachusetts Institute of Technology; MNIST, Modified National Institute of Standards and Technology; PPI, protein–protein interaction; STITCH, Search Tool for Interacting Chemicals; SL, synthetic lethality; BioGRID, The Biological General Repository for Interaction Datasets; HMDD, Human MicroRNA Disease Database; MeSH, Medical Subject Headings; ROSMAP, Religious Orders Study/Memory and Aging Project; LIDC-IDRI, The Lung Image Database Consortium image collection.

Fully Connected Neural Network for Multiomics Data Integration

All the while, the FCNN,^[ ³¹ ^] as an earlier proposed framework, is the simplest way to extract features automatically from the multiomics data. It outperforms traditional machine learning methods. Therefore, many models use FCNN framework to integrate multiomics data. Most of these methods are utilized to execute drug response prediction, synergistic drug combination prediction, cancer subtypes classification, and prognosis (Figure 2 ).

View Image - Figure 2. Graphic summary of the FCNN-based multiomics data integration method. Multiomics data are concatenated as input of FCNN. After the multilayered NN training, the output is applied on many biomedical fields.

Figure 2. Graphic summary of the FCNN-based multiomics data integration method. Multiomics data are concatenated as input of FCNN. After the multilayered NN training, the output is applied on many biomedical fields.

Huang et al.,^[ ³² ^] Sharifi-Noghabi et al.,^[ ³³ ^] and Lin et al.^[ ³⁴ ^] used similar FCNN-based models. The model learnt features for each single-omics data, concatenated these features, and finally fed the integrated features into the subnetwork for the downstream task. Sharifi-Noghabi et al. proposed a deep neural network-based multiomics integration method, MOLI, for drug response prediction. MOLI takes somatic mutation, copy number variation, and gene expression data as input and predicts the response to a given drug as output. MOLI exhibits a competitive performance compared to other methods when acting on multiple datasets.

Same as the previous framework, Preuer et al.^[ ³⁵ ^] developed a DL model, DeepSynergy, to predict drug synergy. This FCNN-based model used the gene expression profiles of cell lines and the chemical descriptors of two drugs as input and then fed them into an individual neural network to learn the features of the multiomics data. Finally, the model predicted the synergistic drug combinations.

Different from DeepSynergy, MatchMaker^[ ³⁶ ^] trained two parallel subnetworks that connected the chemical structure features of a drug to the gene expression features of the corresponding cell line (Figure 3 ). The joint representation of the drug-specific representation on a particular cell line is connected and becomes the input to a third subnetwork that predicts the synergistic effects of the drug pairs.

View Image - Figure 3. The difference between DeepSynergy and MatchMaker framework. The upper part represents the DeepSynergy model, and lower part represents the MatchMaker model. They are both based on FCNN. DeepSynergy is an early fusion method. MatchMaker is an intermediate method.

Figure 3. The difference between DeepSynergy and MatchMaker framework. The upper part represents the DeepSynergy model, and lower part represents the MatchMaker model. They are both based on FCNN. DeepSynergy is an early fusion method. MatchMaker is an intermediate method.

Besides integrating multiomics features, exploring potential correlations between these features is also a key question for multiomics data fusion. Some approaches enable the flow of information between different omics data by changing the structure of the neural network. Bica et al.^[ ³⁷ ^] proposed a crossomics superlayered neural network (SNN) architecture for the integration of multiomics data. The network allows biological interpretation of which genes are most relevant to the decision process and how the multiomics interact with each other. This model uses subnetworks to learn the feature of each omics and adds crosscorrelations between the subnetworks. Each single-omics data plays an important role in downstream tasks by exploiting the correlation between multiomics data. SNN is capable of extracting crosscorrelations presented in multiomics datasets, thus achieving good performance, especially on the datasets with a limited number of training samples. The above results demonstrate that incorporating crossconnections in neural networks allows better integration of multiomics data.

FCNN methods have already been successfully applied to all the above-discussed applications. For most biomedical tasks, the performance of FCNN is at least on par with classical machine learning approaches. These methods can capture the nonlinear and complex relationships of multiomics data and are suitable for handling high-dimensional and noisy data. However, FCNN methods usually require a large volume of data and thus are computationally expensive. Generally, FCNN methods are suitable for all kinds of application scenarios, especially for feature descriptor-based multiomics data integration.

Convolutional Neural Network for Multiomics Data Integration

CNN^[ ^38–40 ^] can greatly reduce the number of parameters, where the same set of parameters (convolutional kernel) are reused in multiple places. It is more proper to capture local features. Compared with FCNN, CNN imposes an a priori restriction on the parameter matrix. The CNN methods are mostly used in medical imaging-assisted diagnosis, gene regulation mechanisms, and cancer subtypes classification (Figure 4 ).

View Image - Figure 4. Graphic summary of the CNN-based multiomics data integration method. Each omics data as input of each CNN. After the multilayered neural networks training, the model extracts omics data representations. These representations are concatenated and passed through a predictor to apply on many biomedical fields.

Figure 4. Graphic summary of the CNN-based multiomics data integration method. Each omics data as input of each CNN. After the multilayered neural networks training, the model extracts omics data representations. These representations are concatenated and passed through a predictor to apply on many biomedical fields.

Fu et al.^[ ⁴¹ ^] proposed a CNN-based model that integrated multiomics information to prioritize candidate genes. This model consists of four convolutional layers, four pooling layers, and three fully connected layers. It used the features of variation counts, expression levels, QTALs/QTANs number, and the module generated by weighted gene coexpression network analysis method as input to predict candidate genes. The results show that CNN framework outperforms other methods based on linear fusion strategy.

A deep convolutional neural network (DCNN)-based integration model of multiomics data, DCNN-Concat, was proposed by Islam et al. to classify cancer subtypes.^[ ⁴² ^] In DCNN-Concat, two feature vectors are used as inputs of two CNNs: one from copy number variation data the other from gene expression data. The architecture of the model is as follows: first, each omics data is passed through a convolutional layer, pooling layer, and fully connected layer to learn a low-dimensional feature. Next, all the outputs of fully connected layers are concatenated together. Finally, these concatenated features are fed to fully connected layers to do classification. Two other models, DCNN-Siamese and DNN-SE, are proposed at the same time. DCNN-Siamese shares the weights between the convolutional and fully connected layers of two branches. DNN-SE trains a stacked autoencoder (SAE) to initialize a fully connected neural network. The accuracy and area under curve (AUC) metrics of these three methods are compared, and DCNN-Concat shows the best performance. Although DCNN-Siamese and DNN-SE don't have the best performance, they have fewer parameters by sharing weights.

In order to train deeper CNN models with higher accuracy, He et al.^[ ⁴³ ^] proposed the ResNet model, which focused on establishing a shortcut connection between the front and back layers. This helps the backward propagation of gradients during the training process, thus enabling the training of deeper CNN networks. Wu et al.^[ ⁴⁴ ^] developed a multiview integration model based on the modified ResNet-50 architecture to discriminate between the patients with COVID-19 pneumonia and others. This model uses the corresponding computed tomography (CT) images in axial, coronal, and sagittal views as input. These three-view images are used to train the DL network. The output features of Res blocks are concatenated and fed into a fully connected dense layer. Finally, the final layer outputs the risk value of COVID-19 pneumonia.

CNN uses convolution layers to reduce the parameter sizes and the network complexity, thus avoiding over-fitting to some extent. However, CNN-based methods with a 1D convolution layer on the input vector may not be suitable for traditional tabular multiomics data fusion according to a previous study.^[ ⁴⁵ ^] Therefore, CNN-based methods are mainly used for medical imaging researches.

Autoencoders for Multiomics Data Integration

As non-end-to-end models, autoencoders do not need enough label data. Autoencoder^[ ⁴⁶ ^] is a type of neural network which takes input data as a learning target and obtains feature representations from the input data. The network consists of two parts, that is, encoder and decoder. The encoder maps high-dimensional data to low-dimensional data, and the decoder reproduces the input data. Since its intermediate layer can represent the low-dimensional features of the input data, AE is used for dimensionality reduction or feature learning.

In this section, we will introduce five types of autoencoders, which are standard AE, denoising autoencoder (DAE), SAE, crossomics autoencoder (CrossAE), and variational autoencoder (VAE). These autoencoder methods are mostly used in synergistic drug combination prediction, SL prediction, single cell-related studies, biomarker identification and cancer subtypes classification, prognosis, and medical imaging-assisted diagnosis (Figure 5 ).

View Image - Figure 5. Graphic summary of the AE-based multiomics data integration method. Each omics data as input of each AE. After the multilayered neural networks training, the model extracts omics data representations. These representations are concatenated and passed through a predictor to apply on many biomedical fields.

Figure 5. Graphic summary of the AE-based multiomics data integration method. Each omics data as input of each AE. After the multilayered neural networks training, the model extracts omics data representations. These representations are concatenated and passed through a predictor to apply on many biomedical fields.

Standard Autoencoder

Zhang et al.^[ ⁴⁷ ^] proposed a DL model, deep neural network synergy model with autoencoders (AuDNNsynergy), to predict drug combinations by integrating multiomics data and chemical structure data, and demonstrated that the proposed model outperformed existing models of drug combination prediction. Zhang compared AuDNNsynergy and DeepSynergy,^[ ³⁵ ^] an end-to-end FCNN model, and found that AuDNNsynergy outperformed DeepSynergy in various metrics. This indicates that the multiomics data integration method of learning the omics features separately by AE works better than by an end-to-end FCNN model when performing drug combination prediction task.

To obtain the subspace characteristics or self-expressive features of the data, Yang et al.^[ ⁴⁸ ^] introduced a novel prediction technique as an extension of similarity network fusion (SNF), named deep subspace fusion clustering (DSFC). DSFC utilizes AE and data self-expressiveness approaches to obtain similarities between patients. DSFC consists of three main parts, that is, calculating data self-expressiveness via deep subspace clustering networks (DSC-Nets), fusing patient-similarity networks constructed by reconstruction coefficients via SNF, and predicting cancer subtypes via spectral clustering. Yang used six cancer datasets containing gene expression, miRNA expression, and DNA methylation data to compare DSFC with the classical method SNF and some other relevant methods. The results show that DSFC achieves performance comparable to other state-of-the-art methods in analyzing patient survival.

Denoising Autoencoder

DAE is a modified AE. DAE can obtain robust representations from a noise input and is useful for recovering the corresponding clean input. Seal et al.^[ ⁴⁹ ^] and Poirion et al.^[ ⁵⁰ ^] trained DAE by adding noise to the original samples and reconstructed the original samples based on the noisy input data. This allows more robust low-dimensional feature representation to be learnt, which can get rid of the original data noise to some extent.

Poirion et al.^[ ⁵⁰ ^] proposed an unsupervised multiomics integration pipeline to predict the survival subtypes in bladder cancer (BC) using a DAE algorithm. Poirion used TCGA dataset containing mRNA, miRNA, and methylation to infer two survival subtypes. First, the processed three types of omics data are passed through DAE separately to obtain the low-dimensional features from the middle layer of DAE. Then, the low-dimensional features are combined to obtain the multiomics integrated features. Finally, the survival subtypes are obtained by Gaussian mixture algorithm clustering. This is an important study that uses DAE to integrate multiomics data to discover survival risk stratification in BC samples.

Guo et al.^[ ⁵¹ ^] proposed a novel DL-based framework to robustly identify ovarian cancer subtypes using DAEs. This model first feeds multiomics ovarian cancer features (mRNA, miRNA, and copy number variation) into DAE to generate low-dimensional features. The integrated features are clustered using k-means for patients. A lightweight logistic regression classification model with gene expression data is further constructed based on the clustered subtypes of k-means, and the robustness of the model is verified on three datasets. To evaluate the clustering performances of the methods, they did comparison experiments between DAE k-means, AE k-means, kernel principal components analysis (KPCA) k-means, and PCA k-means. The results show that DAE k-means performs the best.

SAE

This multiomics integration approach can also use SAE.^[ ⁵² ^] SAE is a multilayer autoencoder, in which each layer builds on the expressed features of the previous layer to learn more abstract features. In multichannel data mining tasks, simply concatenating raw input features may not be sufficient for DL models to produce robust and accurate results. Yuan et al.^[ ⁵³ ^] used two feature encoders, that is, global encoder and channel encoder, to learn potential representations from global- and channel-specific views, respectively. They trained a unified model, which was an innovative channel-aware attention framework (ChannelAtt). The model consists of a multiview representation layer with two encoders and a channel-aware attention layer. Both global and channel encoders are parameterized by two-layer SAEs. ChannelAtt achieves dynamic soft channel selection in multichannel electroencephalogram (EEG) signals and uses learnt low-dimensional features in seizure detection.

Crossomics Autoencoder

The consistency principle assumes that the upper bound of disagreements between omics is the model error. Utilizing the consensus information among multiomics data can get rid of the discrepancies among omics. Tong et al.^[ ⁵⁴ ^] developed a CrossAE using the consensus principle. CrossAE uses gene expression, DNA methylation, miRNA expression, and copy number variation from the TCGA breast cancer datasets as input for survival analysis. This model can maximize the agreement among omics to achieve an omics-invariant feature. To achieve a consistent representation between omics, the algorithm uses the hidden features of an omics to reconstruct the input features of other omics.

Variational Autoencoder

Deep generative models are a powerful framework that can model high-dimensional multiomics data.^[ ^55,56 ^] In particular, VAE is an important class of generative model, which is a modified model of AE. Based on the standard AE, VAE imposes a distributional assumption on the low-dimensional embeddings of latent vectors.

Recently, single-cell variational inference (scVI) using standard VAE has been proposed to analyze scRNA-seq data.^[ ⁵⁶ ^] However, the standard VAE uses a single isotopic multivariate Gaussian distribution over the latent variables and is usually not suitable for sparse data.^[ ⁵⁷ ^] Zuo et al.^[ ⁵⁸ ^] proposed a single-cell multiomics variational autoencoder (scMVAE) for the integrated analysis of scRNA-seq and scATAC-seq data. These data are measured from the same single cell using three types of joint learning strategies. The scMVAE model uses stochastic optimization and a multiomics encoder. First, the scRNAseq and scATAC-seq data across similar cells and features are integrated to approximate the joint latent features, where the Gaussian mixture model (GMM) is prior. Then, the observed expression values by a decoder input data are reconstructed, taking into account the normalization of each type of data. To demonstrate the effectiveness of the method, they applied the scMVAE and other integrated methods to both simulated and real datasets. The results show that the scMVAE outperforms the existing state-of-the-art methods.

Ronen et al.^[ ⁵⁹ ^] developed a multilayer VAE, multiomics autoencoder integration (MAUI), to measure the similarity between colorectal cancer (CRC) tumors and disease models (e.g., cancer cell lines). The MAUI uses copy number variation, gene expression, and point mutations data effectively to learn latent factors in lower dimensions. A deeper encoder structure allows better capture of highly nonlinear features. The MAUI is compared with MOFA, iCluster+, and other published methods for multiomics integration by dimensionality reduction. The results show that MAUI has the best performance for multiomics data integration.

Zhang et al.^[ ⁶⁰ ^] proposed a DL model called OmiVAE to extract low-dimensional features and classify samples from multiomics data. OmiVAE uses the DNA methylation and gene expression profiles as input. The methylation profile has a very large number of CpG sites, almost ten times larger than the feature size of a gene expression profile. In order to reduce the number of parameters, these CpG loci are divided into different fully connected (FC) blocks according to their target chromosomes. Thus, for a set of omics data with large dimensionality, different FC blocks can be used to train the VAE. The performance of OmiVAE is compared with other feature extraction and dimensionality reduction methods such as PCA, kernel PCA, t-SNE, and UMAP.^[ ⁶¹ ^] OmiVAE is demonstrated to outperform other dimensionality reduction methods. Hira et al.^[ ⁶² ^] designed a DL algorithm named maximum mean discrepancy (MMD) VAE with a similar architecture to OmiVAE to analyze ovarian cancer through cancer samples identification, survival analysis, molecular subtypes clustering, and classification. The biggest difference between OmiVAE and MMD-VAE is the loss function. MMD-VAE uses maximum mean discrepancy in loss function instead of the KL divergence.

Chung et al.^[ ⁶³ ^] proposed a long short-term memory (LSTM)-based variational autoencoder (LSTM-VAE) for the unsupervised classification of proteins and metabolites during cardiac remodeling in mice. The LSTM layer facilitates the capture of the temporal features. Compared with the standard VAE, adding the LSTM layer to VAE could better integrate the temporal multiomics data. The LSTM-VAE is trained using temporal proteomics and metabolomics to obtain integrated low-dimensional features, and then the k-means algorithm is used to cluster the low-dimensional features. Based on the comparison of five clustering methods, the LSTM-VAE method shows the best performance.

AE methods, which are unsupervised methods, do not need a lot of labeled data. It is difficult to obtain enough labeled data in the biomedical field. Therefore, AE methods can be applied to a variety of biomedical tasks and are currently a highly active area of multiomics integration research. Furthermore, AE methods are well adapted to high-dimensional omics data. The richness of the information contained in the output embedding improves the relevance for multiomics integration.^[ ²⁶ ^] Currently, AE methods with many variants are the most frequently used methods for a wide range of applications, especially for all kinds of integration scenes lacking labeled data.

Graph Neural Network for Multiomics Data Integration

In the process of omics data integration, adding associations data and topology information between biomedical entities could improve the fusion performance. Recently, GNN models are applied for omics data integration.

The concept of GNN^[ ⁶⁴ ^] was first proposed by Gori et al. in 2005, and the model was elaborated in more detail by Scarselli et al.^[ ⁶⁵ ^] The GNN proposed by Gori et al.^[ ⁶⁶ ^] can process graph-structured data directly, and its core components are the local transfer function and the local output function. Graph convolutional network (GCN)^[ ⁶⁷ ^] introduces convolutional operations into the graph structure and it is one of the most dominant graph neural networks. These GNN methods are mostly used in synergistic drug combination prediction, SL prediction, prognosis, and biomarker identification (Figure 6 ).

View Image - Figure 6. Graphic summary of the GNN-based multiomics data integration method. For clear and concise illustration, an example of each omics’ one sample is chosen to demonstrate the component for multiomics data integration. Sample similarity network generated from each omics data is used as the input of GNN. After each omics-specific GNN training, the model extracts omics data representations. These representations are concatenated and passed through a predictor to apply on many biomedical fields.

Figure 6. Graphic summary of the GNN-based multiomics data integration method. For clear and concise illustration, an example of each omics’ one sample is chosen to demonstrate the component for multiomics data integration. Sample similarity network generated from each omics data is used as the input of GNN. After each omics-specific GNN training, the model extracts omics data representations. These representations are concatenated and passed through a predictor to apply on many biomedical fields.

Complex and systematic biological interactions between different biological entities and graph structure data are more informative, so GCNs are also used in drug discovery prediction tasks. Jiang et al.^[ ⁶⁸ ^] proposed a cell line-specific GCN model using the three-omics dataset, that is, drug–drug synergy (DDS) network, drug–target interaction (DTI) network, and protein–protein interaction (PPI) network. The model can predict drug–drug synergic in the heterogeneous network. The architecture of the model consists of a GCN encoder and a matrix decoder, and the GCN encoder has four hidden layers. The encoder gets the new representation from the multiomics graph and the decoder obtains the predicted synergy score from the representation. As the first GCN-based model in the drug synergy field, this model showed high accuracy.

Hao et al.^[ ⁶⁹ ^] proposed a supervised multiview GAE for human SL prediction, named SLMGAE. They constructed an SL graph based on the known SL interactions and used GCN as the encoder in SLMGAE. They considered the SL graph as the main view, and the data sources graph construction such as PPI and GO as support views. Then, the multiple GAEs are implemented to reconstruct the graphs from these views. GCN serves as an encoder for learning gene embeddings, while the decoder is used for the reconstruction of the learnt embedding graph. Finally, the attentive merging process combines all the reconstructed graphs for SL prediction. This is the first study to date that uses multiview GNN in the SL prediction.

Tang et al.^[ ⁷⁰ ^] proposed a multiview multichannel attention graph convolutional network (MMGCN) for miRNA disease association prediction. MMGCN consists of three parts, that is, a multiview GCN encoder, a multichannel attention mechanism, and a CNN combiner. The GCN encoder uses the multiple similarity graphs of miRNA and disease nodes as input to encode different views. Then, the multichannel attention mechanism focuses on the more important features. Next, the final embedding is obtained by a CNN combiner. After getting the feature embeddings of miRNA and disease, the matrix factorization is used to obtain the potential correlation matrix between miRNA and disease. MMGCN is compared with other methods and shows multiple-view validity in predicting miRNA and disease associations.

Wang et al.^[ ⁷¹ ^] introduced a novel multiomics integrative method for biomedical classification called multiomics graph convolutional NETworks (MOGONET). The model first constructed similarity networks for each omics data using cosine similarity. Using each single-omics feature and the corresponding similarity network as input, a GCN is trained for each omics data to initially predict the category labels. Then, the initial prediction matrix of each omics is constructed as a crossomics discovery tensor, which reflected the crossomics label correlation. Finally, the final label prediction is performed using view correlation discovery network (VCDN), which can efficiently integrate the initial prediction of every single omics by potential correlations between different omics data types in the higher-level label space. From different biomedical classification applications, MOGONET outperforms other state-of-the-art supervised multiomics integrative analysis methods. In addition, this model identified important biomarkers relevant to biomedical studies from different omics data types.

GNN is designed to receive graph structure data as input. The graph structure data such as similarity networks and biomedical heterogeneous networks can naturally simulate the complex interactions between biological entities that intervene at the different scales.^[ ⁷² ^] GNN learns node representations by aggregating information from their neighborhoods in associated networks, hence capturing the interaction among biomedical entities compared to feature descriptor-based methods. However, how to build network for each omics data effectively is a challenge for GNN-based integration. Overall, GNN-based methods are suitable for the graph structure-based multiomics integration.

Capsule Network for Multiomics Data Integration

CNNs are very effective in the field of image classification and recognition, but the internal data representation of CNNs does not take into account the important spatial hierarchical relationships between features. The maximum pooling loses valuable feature information, and the fundamental problem of the spatial location relationship between low-level and high-level features cannot be solved using traditional CNN. Sabour et al.^[ ⁷³ ^] proposed a CapsNet based on a capsule system and a dynamic routing mechanism, which achieved extremely high classification accuracy on the MINST dataset. CapsNet is a revolutionary artificial intelligence learning architecture that can overcome the shortcomings of CNN. These CapsNets methods are mostly used in medical imaging-assisted diagnosis and biomarker identification (Figure 7 ).

View Image - Figure 7. Graphic summary of the CapsNet -based multiomics data integration method. Multiomics data are concatenated as input of CapsNet. After the multilayered neural networks training, the output is applied on many biomedical fields.

Figure 7. Graphic summary of the CapsNet -based multiomics data integration method. Multiomics data are concatenated as input of CapsNet. After the multilayered neural networks training, the output is applied on many biomedical fields.

Afshar et al.^[ ⁷⁴ ^] proposed a new model, named the 3D multiscale capsule network (3D-MCN), to predict lung tumor malignancy. Three independent CapsNets take 3D nodule crops as input. Multiscale input could capture the local features of the nodule and the characteristics of the surrounding tissues. Each CapsNet takes inputs at a different scale, and the output vectors are masked and concatenate into a single vector. The vectors pass through an integration module, which is consisted of a set of fully connected layers, and form the probabilities associate with each category (benign or malignant).

Peng et al.^[ ⁷⁵ ^] proposed a DL approach to discover breast cancer-related genes using capsule network-based modeling of multiomics data (CapsNetMMD). This model uses gene expression, z scores for gene expression, DNA methylation, and two forms of copy number variation as input. These multiomics data are fully integrated to generate feature matrixes of genes. Then the gene identification problem into a supervised classification task by making use of known breast cancer-related genes is transformed. Compared with other ML methods, the evaluations of multiple measurements show that CapsNetMMD has the best performance. The results indicate that the setting of the capsule network instantiation parameters and the dynamic routing mechanism is suitable for cancer-related gene discovery.

Capsule network can learn a good representation using a small amount of data. However, the current implementation of capsule network is less efficient than other DL methods, so it is a big challenge to improve training efficiency. At present, the capsule network is suitable for the integration of limited labeled data.

Generative Adversarial Network for Multiomics Data Integration

GAN was first proposed by Goodfellow^[ ⁷⁶ ^] in 2014. GAN can generate data that does not exist in the real world, similar to artificial intelligence (AI) with creativity and imagination. More specifically, a generative model captures the data distribution, and a discriminative model estimates the probability that the sample comes from the training data rather than from the generative model. This method is used in cancer subtypes classification and prognosis (Figure 8 ).

View Image - Figure 8. Graphic summary of the GAN-based multiomics data integration method. One omics data and two omics interation network as input of GAN. Generator uses one omics data and two omics interation network to synthesize the other omics data. Both input and synthetic omics data are passed through a discriminator to differentiate the real and synthetic data. These updated data are passed through a predictor to apply on many biomedical fields. A represents Omics 1 data. B represents Omics 2 data. The Omics 1–Omics 2 interation network is named bipartite graph and it is represented SA. SAT represents Omics 2–Omics 1 interation network. hA(k) represents the intermediate value of synthetic Omics 1 data in the kth update. HA(k) represents the synthetic Omics 1 data in the kth update. hB(k) represents the intermediate value of synthetic Omics 2 data in the kth update. HB(1) represents the synthetic Omics 2 data in the kth update. ZA represents the final synthetic Omics 1 data. ZB represents the final synthetic Omics 2 data.

Figure 8. Graphic summary of the GAN-based multiomics data integration method. One omics data and two omics interation network as input of GAN. Generator uses one omics data and two omics interation network to synthesize the other omics data. Both input and synthetic omics data are passed through a discriminator to differentiate the real and synthetic data. These updated data are passed through a predictor to apply on many biomedical fields. A represents Omics 1 data. B represents Omics 2 data. The Omics 1–Omics 2 interation network is named bipartite graph and it is represented SA. SAT represents Omics 2–Omics 1 interation network. hA(k) represents the intermediate value of synthetic Omics 1 data in the kth update. HA(k) represents the synthetic Omics 1 data in the kth update. hB(k) represents the intermediate value of synthetic Omics 2 data in the kth update. HB(1) represents the synthetic Omics 2 data in the kth update. ZA represents the final synthetic Omics 1 data. ZB represents the final synthetic Omics 2 data.

Ahmed et al.^[ ⁷⁷ ^] introduced a GAN model called omicsGAN to integrate two-omics data and their interaction network. This model feeds mRNA, miRNA expression data, and their interaction network as input to Wasserstein GANs (wGANs). omicsGAN can introduce a stream of information from single-omics data instead of random noise. Furthermore, using appropriate hyperparameters, the GAN is stimulated to retain the information in this information stream and force the distribution toward the second omics data. This will ensure that the information from both omics data has been integrated into the generated samples. Finally, the generated data is used to perform disease phenotype prediction by a classifier. The model effectively integrates two-omics data and their interaction networks on breast, lung, and ovarian cancer datasets. Compared with the original dataset, the integrated data generated by the model has better performance in cancer outcome classification and patient survival prediction.

GAN, a generative model, takes random noise as input and generates plausible synthetic data similar to a real dataset.^[ ⁷⁷ ^] GAN-based methods are promising in missing value imputation scenarios. Since CapsNet and GAN methods are emerging approaches, and the architectures of these methods are relatively complex; they have only been used for multiomics data integration in the past few years. Their application in a wider range of scenarios needs more exploration.

Conclusion and Discussion

In this article, we reviewed six categories of multiomics data integration methods based on the DL framework. Multiomics data integration methods remove redundant information and extract important low-dimensional feature representations of each omics. Every omics data is not independent, and different omics provide a separate view on a common biomedical problem. We can capture complementary information from multiomics data, which is beneficial for problem solving in the biomedical field.^[ ⁷⁸ ^] In this phase, we concluded that the multiomics data integration DL methods fuse information from multiple omics to achieve the following applications in biomedicine: 1) clinical prediction (cancer subtypes classification, patient stratification, prognosis, and medical imaging-assisted diagnosis); 2) biomarker identification; 3) drug sensitivity prediction; 4) SL prediction; and 5) single cell-related studies.

As mentioned above, DL-based methods have been proved as a promising approach to integrate multiomics data, but their applications in biomedical data analysis still have equally great challenges. The first challenge is that multiomics data integration is often accompanied by the absence of samples in one or several omics. However, there are two main solutions to tackle this challenge, that is, missing-data imputation and small sample learning. Each omics is complementary; thus, the imputation algorithms for missing values can be developed utilizing one known omics feature to generate and impute the missing features of another omics. Also, to take more multiomics features and get more abundant interomics information, the number of samples is inevitably sacrificed. Therefore, it is necessary to design an integration omics that is beneficial to small-scale samples. The second challenge is that the potential of GNN-based methods has not been fully exploited. GNN-based methods achieve competitive performance in a variety of studies.^[ ^69,79,80 ^] Furthermore, GAE, which combines AE and GNN, has achieved success on many tasks.^[ ^81,82 ^] Applying more GNN-based methods to multiomics data integration could be promising and is worthy of further investigation. The last challenge is that most of the state-of-the-art DL methods lack interpretation, also known as “black-box” nature. It is difficult to interpret what features a neural network has extracted and learnt because the output of the model cannot be traced back to distinguish these features. The predictions obtained via DL methods are not sufficient to deepen our understanding of the biomedical field. As a major limitation of the DL model, “black-box” nature limits the ability to provide insights into potential biomedical applications from a biological perspective. The question of which omics of data or features contributes the most in the process of multidimensional data integration remains an issue that needs to be addressed. Although the DL model is an algorithm, its internal structure is unknown. It still has a causal relationship between the input and the output. Developing a method to transform the “black-box” into “white-box” will be a popular research topic in DL.^[ ⁸³ ^]

Acknowledgements

Y.W., L.Z., and D.L. contributed equally to this work. This work was supported by the National Natural Science Foundation of China (http://www.nsfc.gov.cn; nos. 62103436) to S.H.

Conflict of Interest

The authors declare no conflict of interest.

Author Contributions

X.B., S.H., and Z.Z. supervised the work. L.Z. and D.L. performed the references preparation. All authors contributed to writing the article. S.H., Y.W., and D.L. made substantial contributions to the discussion of the content of the article.

Word count: 6412

Show less

© 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The innovation of high-throughput technologies and medical radiomics allows biomedical data to accumulate at an astonishing rate. Several promising deep learning (DL) methods are developed to integrate multiomics data generated from a large number of samples. Herein, a comprehensive survey is conducted and the state-of-the-art DL-based multiomics data integration methods in the biomedical field are reviewed. These methods are classified into six categories according to their model framework, and the specific applicable scenarios of each category are summarized in five biomedicine aspects. DL-based methods offer opportunities for disentangling biomolecular mechanisms in biomedical applications. There are, however, limitations with these methods, such as missing data problem and “black-box” nature. A discussion of some of the recommendations for these challenges is ended.

Details

Title

Deep Learning-Based Multiomics Data Integration Methods for Biomedical Application

Author

Wen, Yuqi¹; Zheng, Linyi²; Leng, Dongjin¹; Dai, Chong³; Lu, Jing⁴; Zhang, Zhongnan²; He, Song¹; Xiaochen Bo¹

¹ Department of Bioinformatics, Institute of Health Service and Transfusion Medicine, Beijing, P. R. China
² School of Informatics, Xiamen University, Xiamen, P. R. China
³ Department of Bioinformatics, Institute of Health Service and Transfusion Medicine, Beijing, P. R. China; College of Life Science and Technology, Beijing University of Chemical Technology, Beijing, P. R. China
⁴ Department of Computer Science and Engineering, University of Shanghai for Science and Technology, Shanghai, P. R. China

Section

Reviews

Publication year

2023

Publication date

May 2023

Publisher

John Wiley & Sons, Inc.

e-ISSN

26404567

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/aisy.202200247

ProQuest document ID

2815837602

Deep Learning-Based Multiomics Data Integration Methods for Biomedical Application

Jump to:

Full text

Abstract

Details

Suggested sources