Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The change in the DNA refers to the term genetic variation which makes us all unique. There are different forms of genetic variation, and most of them are well understood. It can involve changes in the DNA nucleotide or chromosome structure [1, 2]. Human genome is well-off in structural variation where copy number variation (CNV) is the most communal type which is the change in the number of copies in a specific area of the genome [3]. In the 1000 Genome Project data, CNV is known as copy number polymorphism (CNP) [4]. CNVs are DNA regions ranging in size from 1k bases to several megabases [5]. CNV is normally due to insertion, deletion, and/or duplication of the chemical bases (nucleotides). Some CNVs appear first time in the parent’s germ cell called de novo, while others are inherited [6]. Usually, the cell has two copies of each gene; CNV occurs when a part of a gene is deleted or duplicated [7].

Copy number variations affect transcription in humans [8] and have been related to different diseases such as cancer, autism, and schizophrenia [9–11]. All over the world, the most common risk that impends human health is cancer [12]. Cancer is a class of disease which results in irregular growth of cells and is one of the leading causes of human death. The mortality rate of humans due to cancer is about 14.6% each year [13]. Phenotypic variation may also be due to CNVs [6, 14]. The data obtained from CNVs can also be used to classify tumors into malignant and benign [15, 16]. A number of research articles agree that somatic CNVs are mostly associated with the progression of various cancers [17–20].

Machine learning practitioners have proposed a lot of techniques to identify one or multiple types of cancer(s) using various types of genomic data, each with different weaknesses and strengths. During the health checkup, the colonoscopy screening is broadly known for the evaluation of colorectal cancer (CRC) risk, but due to its discomfort and complexity, more reliable and comfortable methods were necessary for the CRC screening. A comprehensive study is presented by Ding et al. [21] about machine learning applications in CNV-based cancer prediction.

Dealing with high-dimensional and heterogeneous data remains a key challenge in healthcare [22]. Traditional methods of machine learning firstly need to perform feature extraction and selection to obtain more useful features from the data and then build prediction models on them. The advancement in deep learning technologies provides effective approaches to obtain end-to-end learning models. Deep learning is a fashionable toolbox and has become popular for big data [23, 24] especially in the field of genomics due to its performance in prediction problems. It is used for many processes such as predicting DNA sequence conversation, identifying enhancers and promoters, and detecting genetic variation from DNA sequencing. The advancement and fruitful applications of deep learning in different fields of genomics reveal that it can be used for cancer classification from CNV data [22, 25–27].

Different computational models for the cancer classification based on copy number variation data are available. The most recently developed model achieves an accuracy up to 85%. The copy number variation data are high dimensional in nature and difficult to handle by the classical machine learning techniques. In this study, we implemented deep learning models that successfully used 24,174 genes of CNV levels to classify six types of cancers: breast adenocarcinoma (BRCA), urothelial bladder carcinoma (BLCA), colon and rectal carcinoma (COAD/READ), glioblastoma multiforme (GBM), kidney renal clear cell carcinoma (KIRC), and head and neck squamous cell (HNSC). The highest obtained average training accuracy is 96%, while testing accuracy is 92%. We have proposed three different deep learning architectures, and all of these models have outperformed state-of-the-art techniques in terms of accuracy, ROC, and precision, while two of our networks have outperformed the state-of-the-art models in terms of recall (see Table 1). So, the contribution of this work is not only to improve the performance (accuracy) of the cancer classifier using an end-to-end model but also to find out which architecture among DNN (deep fully connected neural network), CNN, and RNN is suitable for CNV data. According to our finding, DNN performs better than the rest of the two.

Table 1

The average performances of different models along with the state of the art.

S. no	Models	Train Acc	Val Acc (%)	ROC area	Precision	Recall
1	${DNN}_{3}$	95%	91	0.99	0.88	0.87
2	${DNN}_{5}$	96%	92	0.99	0.89	0.88
3	LSTM	95%	91	0.98	0.89	0.85
4	1D-CNN	88%	90	0.98	0.88	0.85
5	Sana Fekry et al. [38]	—	85.9	0.965	0.852	0.862

We have discussed the literature review in Section 2, while Section 3 covers the explanation of the dataset and architectures of our models. Section 4 deals with the training process of our models along with obtained results and our findings. Finally, we have concluded our work in Section 5.

2. Related Work

Xu et al. [28] have identified the chromosomal alterations in plasma for early detection of CRC. They analyzed the CNVs in cfDNA (cell-free DNA) by using the regular z score, and the SVM classifier was trained for identification of colon and rectal cancers. The patients with early two stages (I and II) were detected. Brody et al. [29] used blood samples of 8,821 different patients. For feature extraction, they have extracted germline DNA copy number variation data by a single laboratory with an SNP 6.0 array. The gradient boosting algorithm is used to predict breast, ovarian, brain, and colon cancers. Ricatto et al. [30] used a discretizer for feature extraction and a fuzzy rule-based predictor for tumor classification.

In women, breast cancer is the most common type of cancer, which has further subtypes [31]. Pan et al. [32] carried out feature extraction and selection using MCFS (Monte Carlo feature selection). IFS (incremental feature selection) is used to better represent the core CNVs in different subtypes of breast cancer, and then, the dag-stacking model is integrated to detect multiple types of breast cancer. Islam et al. [33] focused on the prediction of molecular subtypes of breast cancer. They performed the experiments to identify binary classes, i.e., estrogen receptor (ER+ and ER−) and multiple classes, i.e., PAM50 (luminal A, luminal B, Her2 enriched, and basal-like). Afterwards, they performed the chi-square test to select the topmost significant genes. For classification, DCNN (deep convolution neural network) was used. Lu et al. [34] also focused on the classification of breast cancer. The authors have introduced a module-based network integrated with genomic data to identify important driver genes in BRCA subtypes. CNV analysis was performed by Li et al. [35] on tumor development. The use case was breast cancer, where they collected data from the TCGA-BRCA project. They searched OMIM (Online Mendelian Inheritance in Man) for most relevant CNVs. They have chosen six candidate genes: ErbB2, AKT2, KRAS, PIK3CA, PTEN, and CCNDI. Furthermore, they have constructed two types of distance-based oncogenetic trees to find which of the above candidate genes play a significant role in the development of breast cancer. Their findings showed that ErB2 has early alteration, while AKT2, KRAS, PIK3CA, PTEN, and CCNDI have late alterations in human breast cancer. Alshibli et al. [36] have proposed deep convolution-based neural networks for CNV data to classify six types of cancer. They have lent the famous computer vision architectures, i.e., ResNet16 and VGG16. Their average accuracy is 86%. They reported that their proposed model has the lowest performance for UCEC (uterine corpus endometrial carcinoma).

To understand the association of CNVs with various types of human cancer, Zhang et al. [37] collected CNV data of different cancer classes consisting of 24,174 genes as features. The feature selection was carried out using minimal redundancy maximal relevance (mRmR) and incremental feature selection (IFS), which resulted in the selection of 200 genes. The dagging model is used for the classification phase of multiple types of cancer. Fekry et al. [38] also worked on these CNV levels of 24,174 genes to classify a set of human cancer types named as breast adenocarcinoma (BRCA), urothelial carcinoma (BLCA), colon and rectal carcinoma (COAD/READ), glioblastoma multiforme (GBM), kidney renal clear cell carcinoma (KIRC), and head and neck squamous cell (HNSC). They selected 16,381 important genes of CNV levels using the filter method (i.e., information gain). For classification, they used seven different classifiers: support vector machine, j48, neural network, random forest, logistic regression, dagging, and bagging. The authors in [39] have contributed to cancer classification using the self-normalizing neural network. They have used Monte Carlo feature selection and incremental feature selection (IFS). They have worked on multiple cancer types and obtained 79% accuracy.

Most recently, researchers are using CNV data along with other modalities such as clinical and/or gene expression data to improve the performance metrics of their models. A contribution is made by researchers in [40] using multimodality data to classify subtypes of breast cancer with the help of the SVM (support vector machine) and RF (random forest). A deep learning model using multi- modality data is used to predict the subtype of breast cancer in [41, 42]. Another deep learning model along with multimodalities of data is used in [43] to predict Alzheimer’s disease. The researchers in [44] have trained their deep learning model on multimodalities to predict therapeutic targets in breast cancer. A comprehensive comparison of multimodalities is presented in [45].

3. Materials and Methods

3.1. Dataset

For experimentation, we have selected the same dataset used by [38] in order to be compatible in result comparison. The said dataset is composed of six cancer types containing DNA CNVs of 24,174 genes (features/dimensions) for 2916 samples; therefore, the shape of the dataset is $X_{2916 \times 24174}$ if X is the input dataset. This dataset was taken from the cBioPortal for Cancer Genomics database http://cbio.mskcc.org/cancergenomics/pancan_tcga/. The database contains 11 different types of cancer, and each cancer type has its own samples. The CNV levels were regularized into five distinct values in the database with −2 for homozygous deletion, −1 for heterozygous deletion, 0 for diploid, 1 for low-level gain, and 2 for high-level gain. In this research, we used six different types of cancer, which are listed in Table 2, with names and the number of samples in each class (cancer type).

Table 2

The distribution of samples with respect to each cancer type in our dataset.

Sr.	Cancer type	No of samples
0	BRCA (breast carcinoma)	847
1	BLCA (bladder urothelial)	135
2	COAD/READ (colon and rectal adenocarcinoma)	575
3	GBM (glioblastoma multiforme)	563
4	KIRC (kidney renal cell carcinoma)	306
5	HNSC (head and neck squamous cell)	490
Total		2916

3.2. Our Proposed Models

3.2.1. DNN (Deep Fully Connected Neural Network)

An artificial neural network (ANN) is a powerful computational tool that mimics the human brain working behavior [46]. A neural network (NN) consists of a set of neurons arranged in layers such as the input, hidden, and output layer. A single neuron takes an input vector, calculates the weighted sum, and applies the activation function to decide whether it should fire or not. In the fully connected neural network, every neuron of the previous layer is connected to all neurons of the next layer.

For a network of $L$ number of layers, the $l^{th}$ layer is specified by the associated weight matrix $W^{l} \in ℜ^{n^{l - 1} \times n^{l}}$ , where $n^{l - 1}$ and $n^{l}$ represent the number of neurons in previous and current layers, respectively. The weighted summation of the $l^{th}$ layer is given by $\begin{matrix} (1) & Z^{l} = W^{T} A^{l - 1} + b^{l}, \end{matrix}$ where $b \in ℜ^{n^{l} \times 1}$ is the bias vector and $A^{l - 1} \in ℜ^{l - 1 \times 1}$ is the activation map of the previous layer.

To speed up the network convergence [47], we have used the batch normalization that scales the $Z^{l}$ in a specified range. Algorithm 1 explains the batch normalization in detail.

Algorithm 1: Batch normalization.

Input: $Z^{l}, β, γ$

$μ_{Z}^{l} = 1 / m \sum_{i = 1}^{m} z_{i}^{l}$ //computing mean of $Z^{l} σ_{Z}^{l} = \sqrt{ε + 1 / m \sum_{i = 1}^{m} {z_{i}^{l} - μ^{l}}^{2}}$ //computing standard deviation of $Z^{l}$

${\tilde{Z}}^{l} = Z^{l} - μ_{Z}^{l} / σ_{Z}^{l}$

${\tilde{Z}}^{l} = γ {\tilde{Z}}^{l} + β$ //scaling and shifting ${\tilde{Z}}^{l}$

Return ${\tilde{Z}}^{l}$

In Algorithm 1, the parameters $γ$ and $β$ maintain the expressive power of the network, while $ϵ$ is a small positive constant added for computational stability [48]. During the forward pass, an activation map $A^{l}$ is estimated for each layer, $l = 1,2, \dots L$ , to know which neuron should be fired: $\begin{matrix} (2) & A^{l} = g {\tilde{Z}}^{l}, \end{matrix}$ where $g$ is the activation function. Here, we have used the rectified linear unit (ReLU) as an activation function for all hidden layers: $\begin{matrix} (3) & A^{l} = Re LU {\tilde{Z}}^{l} = \max 0, {\tilde{Z}}^{l} . \end{matrix}$

The ReLU expedites the training and avoids the vanishing gradient [49]. The last layer in the network is called the output layer (classification layer), which gives the probability of occurrence of different classes. Let there are $K$ classes, and then, the probability of the dominant class is given by the softmax function: $\begin{matrix} (4) & \hat{y} = \underset{k}{\arg \max} \frac{e^{z_{k}^{L}}}{\sum_{k = 1}^{K} e^{z_{k}^{L}}}, \end{matrix}$ where $z_{k}^{L}$ is the weighted sum of the $k^{th}$ unit of output layer $L$ . In our case, the data contain six classes; thus, we set $K = 6$ .

In the deep fully connected neural network (DNN) category, we have implemented the networks from shallow to deep by increasing hidden layers one by one. Furthermore, the number of neurons is reduced with a factor of $\sim$ 2 from beginning to end, to achieve dimensionality reduction. We started with a network of three hidden layers as shown in Figure 1 and continued up to seven layers. Aforementioned, we have used ReLU as an activation function in hidden layers with batch normalization and softmax at the output layer. To overcome the issue of overfitting, we have used dropout layers as well. For more details about the dropout layer, read the work of Srivastava et al. [50]. Note that, each input vector $X$ contain 24,174 features, while the activation map, $A^{L - 1}$ , of the last hidden layer contains 150 features, which shows dimensionality reduction. For training, the Adam optimization algorithm along with categorical crossentropy as a loss function is used.