Gene Selection and Classification of scRNA-seq

Full text

Turn on search term navigation

This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

1. Introduction

In scRNA-seq data, there often are amounts of genes and may reach tens of thousands. Some genes are irrelevant or unsuitable for classification tasks, and they may seriously affect the efficiency of downstream data analysis. If all genes are utilized in data classifications, the classification accuracy and classification efficiency may be low. In order to obviate these irrelevant genes and select high quality genes, an effective and efficient gene selection algorithm is vital.

Feature selection (FS) problems can be taken as large-scale global optimization problems [1]; therefore, we can use bioinspired intelligence optimization algorithms to address feature selection problems. Wang et al. [1] took FS problems regard as large-scale global optimization problems. Nakisa et al. [2] utilized the evolutionary computation (EC) to search the optimal feature subset. Eroglu and Kilic [3] integrated a genetic local search algorithm and a k-nearest neighbor classifier to select feature subset. Maleki and Zeinali [4] used a hybrid genetic algorithm (GA) to address dimension reductions and applied it to the classification of lung cancer. Tahir et al. [5] presented a binary chaotic GA to select feature for healthcare datasets. However, how to correctly use the GA to address the gene selection and classification of scRNA-seq data is a significant issue to consider first. To the best of our knowledge, there are only a few literatures so far.

The study integrates the IGR and DCGA to address the gene selection and classification of scRNA-seq data and proposes a novel gene selection and classification algorithm IGRDCGA. The IGRDCGA utilizes the IGR to eliminate irrelevant genes roughly and obtain a preliminary gene subset and then employs the DCGA to choose high quality genes finely from the preliminary gene subset.

The rest of this study is organized as follows. Section 2 briefly describes the information gain ratio and genetic algorithm. Section 3 states three evaluation metrics. The dataset and preprocessing method to use in the study are described in Sections 4. The coding and the other details of the IGRDCGA are described in Section 5. The numerical results of the IGRDCGA and several competing algorithms are given in Section 6. The conclusion of the study is made in Section 7.

2. Related Work

In this section, the information gain ratio and genetic algorithm are described as follows.

2.1. Information Gain Ratio

The information gain is a metric derived from information entropy, often used to evaluate the mutual dependence level between two random variables. Namely, it is a symmetrical metric of dependency [6].

For two discrete random variables X and Y, their information entropy can be calculated, respectively, in terms of the following formulas: $\begin{matrix} (1) & H X = - \sum_{x \in X} p x \log_{2} p x, \\ (2) & H Y = - \sum_{y \in Y} p y \log_{2} p y, \end{matrix}$ where $x \in X, y \in Y$ , and $p x, p y$ represent the marginal probability of $x$ and $y$ , respectively.

The conditional entropy and information gain [6–8] of X versus Y can be calculated in terms of the first two following formulas, respectively. The information gain ratio of X versus Y is the ratio of the information gain to the information entropy, which is formulated in the last following formula. $\begin{matrix} (3) & H X | Y = - \sum_{y \in Y} p y \sum_{x \in X} p x | y \log_{2} p x | y, \\ (4) & IG X; Y = H X - H X | Y, \\ (5) & IGR X; Y = \frac{IG X; Y}{H X} . \end{matrix}$

The $IGR X; Y$ ranges from 0 to 1 while 1 represents that X completely leads to Y and 0 represents that X and Y are completely independent.

2.2. Genetic Algorithm

The genetic algorithm (GA) is a bioinspired intelligence optimization algorithm. It is inspired by the process of a natural selection and belongs to one of evolutionary algorithms (EAs) [9–13]. It is commonly utilized to generate feasible solutions for optimization problems by performing the operators such as selection, crossover, and mutation [14–16]. The selection operator is designed to choose a part of chromosomes for crossover operator from previous population. The frequently used selection operator is random selection strategy, such as tournament selection strategy. The crossover operator is designed to exchange one or many genes in two parents that are selected by selection operator. It simulates reproduction or recombination in biological evolution process. GA determines whether to perform mutation operator or not according to crossover probability. By crossover operator, two parents may generate two or many offsprings in terms of different crossover strategy. The frequently used crossover operator includes single-point crossover, two-point crossover, and multipoint crossover. Mutation operator is designed to modify one or many genes in certain chromosome. It simulates gene mutation in biological evolution process. Similarly, GA determines whether to perform mutation operator or not according to mutation probability. The frequently used mutation operator includes locus mutation, exchange mutation, and insertion mutation. Mutation operator only acts on one chromosome while crossover operator acts on two chromosomes. Generally, crossover operator probability is much larger than mutation probability. This accords with a biological evolution process.

3. Evaluation Metrics

To evaluate the performance of the IGRDCGA, in the study, we utilize the following evaluation metrics: NMI (normalized mutual information) [17–19], ARI (adjusted random index) [20, 21], and purity [22].

3.1. NMI (Normalized Mutual Information)

The NMI is a frequently used evaluation metric. It can be often used to evaluate the accuracy and the difference between the obtained clustering results and the ground truth results.

For two discrete random variables X and Y, their MI (mutual information) can be calculated in terms of the following formula: $\begin{matrix} (6) & MI X; Y = \sum_{x \in X} \sum_{y \in Y} p x, y {log}_{2} \frac{p x, y}{p x p y}, \end{matrix}$ where $p x$ and $p y$ , respectively, denote marginal probability of $x$ and $y$ and $p x, y$ denotes a joint distribution probability.

The normalized MI is taken as NMI, which can be calculated in terms of the following formula. $\begin{matrix} (7) & NMI X; Y = \frac{2 * MI X; Y}{H X + H Y}, \end{matrix}$ where $H X$ and $H Y$ denote the entropy of X and Y, respectively. The NMI ranges from 0 to 1 while 1 is the optimal score, which represents that X and Y have identical mutual information. A larger NMI signifies higher concordance between X and Y.

Example 1.

Suppose that the ground truth class labels of a single-cell dataset are as follows: y₁ = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3] and the classification result of an algorithm is as follows: y₂ = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3]. It follows from y₁ and y₂ that their unique value is both [1 2 3]. Then, the probability values in formula (6) are calculated as follows: $\begin{matrix} (8) & p y_{1} = 1 = p y_{1} = 2 = \frac{6}{16} = 0.3529, \\ p y_{1} = 3 = \frac{5}{17} = 0.2941, \\ p y_{2} = 1 = \frac{8}{17} = 0.4706, \\ p y_{2} = 2 = \frac{5}{17} = 0.2941, \\ p y_{2} = 3 = \frac{4}{17} = 0.2353, \\ p 1,1 = p y_{1} = 1, y_{2} = 1 = \frac{5}{17} = 0.2941, \\ p 1,2 = p y_{1} = 1, y_{2} = 2 = \frac{1}{17} = 0.0588, \\ p 1,3 = p y_{1} = 1, y_{2} = 3 = \frac{0}{17} = 0, \\ p 2,1 = 0.0588, p 2,2 = 0.2353, p 2,3 = 0.0588, \\ p 3,1 = 0.1176, p 3,2 = 0, p 3,3 = 0.1765. \end{matrix}$

According to formula (6), it follows that $\begin{matrix} (9) & MI y_{1}; y_{2} = \sum_{x \in y_{1}} \sum_{y \in y_{2}} p x, y \log_{2} \frac{p x, y}{p x p y} = 0.5654. \end{matrix}$

According to formula (1) or (2), it follows that $\begin{matrix} (10) & H y 1 = - \sum_{x \in y_{1}} p x \log_{2} p x = - \frac{6}{17} \times \log_{2} \frac{6}{17} \\ - \frac{6}{17} \times \log_{2} \frac{6}{17} - \frac{5}{17} \times \log_{2} \frac{5}{17} = 1.5799 \\ H y_{2} = 1.5222. \end{matrix}$

According to formula (7), it follows that $\begin{matrix} (11) & NMI y_{1}; y_{2} = \frac{2 * MI y_{1}; y_{2}}{H y_{1} + H y_{2}} = 0.3645. \end{matrix}$

3.2. ARI (Adjusted Rand Index)

The ARI is another widely used evaluation metric for measuring the concordance between two clustering results. The RI (Rand Index) measures the similarity between two results. It calculates all pairs of samples, including the pairs in identical or different clusters, and pairs in the predicted and ground truth clusters.

For two clustering results X and Y containing n elements, the RI of them is calculated in terms of the following formula. $\begin{matrix} (12) & RI = \frac{a + b}{a + b + c + d}, \end{matrix}$ where a is the number of pairs in identical class in X and identical cluster in Y; b is the number of pairs in identical class in X but not identical cluster in Y; c is the number of pairs that are not in identical class in X but in identical cluster in Y; d is the number of pairs that are neither in identical class in X nor in identical cluster in Y.

The overlap between X and Y can be summarized in a contingency table, in which each element represents the number of objects in common between X and Y. The ARI is adjusted RI, which can be calculated in terms of the following formula. $\begin{matrix} (13) & ARI = \frac{\sum_{i j} \begin{matrix} n_{i j} \\ 2 \end{matrix} - \sum_{i} \begin{matrix} a_{i} \\ 2 \end{matrix} \sum_{j} \begin{matrix} b_{i} \\ 2 \end{matrix} / \begin{matrix} n \\ 2 \end{matrix}}{1 / 2 \sum_{i} \begin{matrix} a_{i} \\ 2 \end{matrix} + \sum_{j} \begin{matrix} b_{i} \\ 2 \end{matrix} - \sum_{i} \begin{matrix} a_{i} \\ 2 \end{matrix} \sum_{j} \begin{matrix} b_{i} \\ 2 \end{matrix} / \begin{matrix} n \\ 2 \end{matrix}}, \end{matrix}$ where n_ij are values from the contingency table, a_i is the sum of the i-th row of the contingency table, b_j is the sum of the j-th column of the contingency table, and denotes a binomial coefficient.

The ARI is the corrected-for-chance version of the RI. The RI may only yield a value between 0 and 1; the ARI can yield negative values if the index is less than the expected index. The optimal score of the ARI is 1, which represents that two clustering results are identical. A larger ARI signifies higher concordance between X and Y.

3.3. Purity

The purity [22] is another simple and transparent evaluation metric for evaluating the clustering performance. For purity, each identified cluster is assigned to the label which is most frequent in the cluster, and then the accuracy of this label assignment is computed by counting the number of correctly assigned cells and dividing by the number of cells N. This mapping is not one-to-one and may be biased to the class which has the largest size. Nonetheless, it provides us a simple metric.

For two clustering results X and Y containing n elements, the purity of them is calculated in terms of the following formula. $\begin{matrix} (14) & purity X, Y = \frac{\sum_{p} \max_{q} X_{p} \cap Y_{q}}{n}, \end{matrix}$ where p and q are the elements of X and Y, respectively.

The purity ranges from 0 to 1 while 1 is the optimal score, which denotes that X and Y have identical clustering accuracy. A larger purity signifies higher concordance between X and Y.

4. Dataset and Preprocessing

4.1. Dataset

In the study, 33 publicly available scRNA-seq datasets are utilized to testify the performance of the IGRDCGA, which are shown in Table 1. The datasets contain various single-cell gene expression data that are published by many different publications. Each row in the datasets denotes an observation or cell while each column denotes a feature or gene. Table 1 shows the features of the datasets, such as GSE, name, the number of cells (#cell), the number of genes (#gene), the number of the ground truth classes (#class), and references.

Table 1

ScRNA-seq datasets and their features.

GSE/ID	Name	# Cell	# Gene	#Class	References
GSE74543	Allodiploid	16	2406	2	Li et al., 2016 [23]
GSE42268	Sasagawa	23	32700	3	Sasagawa et al., 2013 [24]
GSE38495	Ramskold	33	21042	7	Ramskold et al., 2012 [25]
GSE57249	Biase	56	25734	4	Biase et al., 2014 [26]
GSE51372	Ting	187	21583	7	Ting et al., 2014 [27]
GSE75688	Chung	518	41821	4	Chung et al., 2017 [28]
GSE85908	Yeo	214	27473	4	Yeo et al., 2017 [29]
GSE64016	Ning	247	19080	3	Ling et al., 2015 [30]
GSE60783	Ginhoux	251	11834	3	Schlitzer et al., 2015 [31]
GSE87795	Su	367	15333	6	Su et al., 2017 [32]
GSE59739	Usoskin	622	25334	4	Usoskin et al., 2015 [33]
GSE45719	Deng	268	22431	6	Deng et al., 2014 [34]
GSE53386	Fan	66	26357	6	Fan et al., 2015 [35]
E-MTAB-3321	Goolam	124	40315	5	Goolam et al., 2016 [36]
E-MTAB-2600	Kolodz	704	13473	3	Kolodziejczyk et al., 2015 [37]
GSE81682	Nestorowa	1656	4773	9	Nestorowa et al., 2016 [38]
GSE52583	Treutlein	80	23271	5	Treutlein et al., 2014 [39]
GSE75140	Camp15	734	18927	6	Camp et al., 2015 [40]
GSE81252	Camp17	777	19020	7	Camp et al., 2017 [41]
GSE36552	Yan	90	20214	6	Yan et al., 2013 [42]
GSE83139	Wang	635	19950	8	Wang et al., 2016 [43]
GSE81861	Li1	561	55186	9	Li et al., 2017 [44]
GSE73727	Li2	60	180253	6	Li et al., 2016 [45]
GSE57872	Patel	430	5948	5	Patel et al., 2014l [46]
SRP041736	Pollen	301	23730	11	Alex et al., 2014 [47]
GSE76381	Manno_m	2150	24378	32	La Manno et al.2016 [48]
GSE76381	Manno_h	4029	20560	56	La Manno et al.2016 [48]
GSE71585	Tasic	1679	24057	18	Tasic et al., 2016 [49]
GSE60361	Zeisel	3005	19486	9	Amit 2015 [50]
GSE76983	Grun	1502	23536	2	Grun et al., 2016 [51]
GSE84133	Baron	1886	14878	13	Baron et al., 2016 [52]
GSE85241	Muraro	2126	19127	10	Muraro et al., 2016 [53]
GSE81608	Xin	1600	39851	8	Xin et al., 2016 [54]

4.2. Data Preprocessing

For a scRNA-seq dataset containing N cells and M genes, if $x_{i j}$ represents gene expression level of the i-th cell versus the j-th gene, then its adjacent matrix can be expressed as $D = {x_{i j}}_{N \times M}$ .

To lower the impacts of large gene expression levels on little gene expression levels, all the data in $D$ are normalized according to the following formula. $\begin{matrix} (15) & x_{i j}^{'} = \frac{x_{i j} - x_{j}^{\min}}{x_{j}^{\max} - x_{j}^{\min} + δ}, i = 1,2, \dots, N, j = 1,2, \dots, M, \end{matrix}$ where $x_{j}^{\min}$ and $x_{j}^{\max}$ , respectively, represent the minimum and maximum values of the j-th gene in $D$ . The parameter $δ$ is a very small value, which is to escape the denominator of 0, because $x_{j}^{\min}$ and $x_{j}^{\max}$ may be both equal to 0 in scRNA-seq dataset.

It is obvious that all the elements are normalized according to each column (each column denotes a gene). Therefore, after preprocessing, the gene expression level of each gene ranges from 0 to 1, and all the elements in D range from 0 to 1 as well.

5. The Proposed Algorithm

In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data by combining information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA for short). The coding and the other details of the IGRDCGA are as follows.

5.1. Coding and Initialization

To choose high quality genes from a huge number of genes, we need to design a good coding for each chromosome. For the scRNA-seq dataset containing N cells of M genes, we design a coding of variable length, whose length can be changed. We number M genes into 1∼M; the coding of a chromosome can be expressed as follows: $\begin{matrix} (16) & C = c_{1}, c_{2}, \dots, c_{l}; 1 \leq c_{1}, c_{2}, \dots, c_{l} \leq M, \end{matrix}$ where $c_{1}, c_{2}, \dots, c_{l}$ , respectively, denote the first locus, the second locus, and the l-th locus of the coding, and they are any integer values larger than 1 and less than M; l denotes the length of the coding of the chromosome, which is variable for different chromosomes.

Obviously, the coding of a chromosome signifies a combination of genes, namely, a result of gene selection. Therefore, different chromosomes can signify different gene selection, and coding of variable length can signify gene selection of variable length.

According to the above coding rules of a chromosome, Algorithm 1 presents the initialization algorithm to generate the initial population whose population size is Npop.

Algorithm 1: Initialization of population.

(1) Generate one random integer l in the interval [1, M].

(2) Generate l random integers in the interval [1, M].

(3) Loop the above (1) and (2) Npop times to obtain the initial population.

5.2. Crossover Operator

As the length of the coding of a chromosome is variable, we need to design a new crossover operator to fit the coding of variable length. Given the coding of two parents C₁ and C₂, their lengths are, respectively, l₁ and l₂, and the minimum value of l₁ and l₂ is marked as l₃. The number of the genes to exchange in crossover operator should be less than or equal to l₃. The study presents a new crossover operator as shown in Algorithm 2.

Algorithm 2: Crossover operator.

(1) Generate one random integer l₄ that is less than or equal to l₃.

(2) For C₁, generate the location to exchange loc₁, which contains l₄ different random integers that are larger than or equal to 1 and less than or equal to l₁.

(3) For C₂, generate the location to exchange loc₂, which contains l₄ different random integers that are larger than or equal to 1 and less than or equal to l₂.

(4) The coding bits of the location loc₁ in C₁ and the coding bits the location loc₂ in C₂ are exchanged among each other.

5.3. Mutation Operator

Similarly, we also need to design a new mutation operator to fit the coding of variable length. Given the coding of one parents C₃, its length is l_5. The location of the genes to mutate in mutation operator should be less than or equal to l_5. The study presents a new mutation operator as shown in Algorithm 3.

Algorithm 3: Mutation operator.

(1) Generate one random integer l₆ that is less than or equal to l₅.

(2) For C₃, generate the location to mutate loc₃, which contains l₆ different random integers that are larger than or equal to 1 and less than or equal to l₅.

(3) The coding bits of the location loc₃ in C₃ are replaced into the other integers that are not equal to any of the coding bits of C₃.

5.4. Detailed Steps of IGRDCGA

To illustrate the main process of the proposed algorithm, Figure 1 shows the flow chart of the IGRDCGA.

[figure omitted; refer to PDF]

The detailed steps of the IGRDCGA are summarized as follows.

Step (1). Compute information gain ratio IGR of each gene in terms of formula (5) first. Then, eliminate those genes whose IGR is 0.

Step (2). To eliminate irrelevant genes and choose high quality genes, the threshold method is used to choose those genes whose IGR is the top best $δ$ . In later numerical experiment, $δ = 1 / 3$ .

Step (3). Implement Algorithm 1 described in Section 5.1 to obtain the initial population P₁.

Step (4). Compute the fitness and obtain the best chromosome.

(1) For each chromosome p_i in P₁, select those genes determined by p_i to obtain a new data $D^{'}$ ; then conduct k means clustering on $D^{'}$ to compute NMI metric as the fitness $f p_{i}$ of $p_{i}$ .

(2) Obtain the best chromosome $x^{*}$ and its fitness $f x^{*}$ by comparing the fitness.

Step (5). Let the number of iterations $t \leftarrow t + 1$ ; then judge whether $t > t_{\max}$ or not. If yes, then output $x^{*}$ and $f x^{*}$ , and the IGRDCGA terminates; otherwise, turn to the next step.

Step (6). Tournament selection strategy [55, 56] is used to choose $N p o p$ chromosomes from P₁ to obtain new population P₂.

Step (7). Generate a random r. If r is less than or equal to the crossover probability $p_{c}$ , then randomly select two chromosomes from P₂ to perform the crossover operator described in Section 5.2. The final population is marked as P₃.

Step (8). Generate a random r. If r is less than or equal to the mutation probability $p_{m}$ , then randomly select one chromosome from P₃ to perform the mutation operator described in Section 5.3. The final population is marked as P₄.

Step (9). Compute the fitness value of each chromosome in P₄ and obtain the best chromosome $x_{2}^{*}$ and its fitness $f x^{*}$ . If $f x_{2}^{*} > f x^{*}$ , then $f x^{*} \leftarrow f x_{2}^{*}$ . $x^{*} \leftarrow x_{2}^{*}$ .

Step (10). Turn to Step 5.

5.5. Time Complexity Analysis of IGRDCGA

Suppose that a scRNA-seq dataset contains N cells and M genes, and the population size in GA is Npop. The time complexity of the IGRDCGA is analyzed as follows.

The time complexity of Step 1 is O (M), the time complexity of Step 2 is O (1), and the time complexity of Algorithm 1 in Step 3 is O (Npop). In Step 4, the time complexity of the k means clustering is O (N), and that of obtaining the best chromosome is also O (Npop). Step 5 to Step 10 are the main iteration process of the IGRDCGA; their time complexities determine the time complexity of the IGRDCGA. The execution count of Steps 5 and 6 is both t_max (the maximal number of iterations), and the time complexities are both O (t_max). In Step 7, the time complexity of selecting two chromosomes is O (t_max), while the counts of executions of the crossover operator are l3 (the minimum value of the lengths of two chromosome codes), and its time complexity is still O (l3 $*$ t_max). Therefore, the time complexity of Step 7 is O (l3 $*$ t_max). Similarly, the time complexity of Step 8 is still O (l6 $*$ t_max). In Step 9, the time complexity of computing the fitness is O (t_max $*$ Npop $*$ N), and the time complexity of obtaining the best chromosome is also O (t_max $*$ Npop). Namely, the time complexity of Step 9 is still O (t_max $*$ Npop). Obviously, the time complexity of Step 10 is O (t_max).

To sum up, the time complexity of the IGRDCGA involves the following time complexities: O (1), O (M), O (Npop), O (t_max), O (l3 $*$ t_max), O (l6 $*$ t_max), O (t_max $*$ Npop), and O (t_max $*$ Npop $*$ N). Obviously, O (1) < O (M), O (Npop) < O (t_max $*$ Npop) < O (t_max $*$ Npop $*$ N), O (t_max) < O (t_max $*$ Npop) < O (t_max $*$ Npop $*$ N). Therefore, the decisive time complexities of the IGRDCGA are O (M), O (l3 $*$ t_max), O (l6 $*$ t_max), O (t_max $*$ Npop $*$ N). As l3 < M and l6 < M, it follows that O (l3 $*$ t_max) < O (M $*$ t_max), O (l3 $*$ t_max) < O (M $*$ t_max), and O (M) < O (M $*$ t_max). Thus, the time complexity of the IGRDCGA is determined by O (t_max $*$ Npop $*$ N) and O (M $*$ t_max). From Table 1, it can be clearly shown that, for most single-cell datasets, $M ≻ N$ . Commonly, $M ≻ N p o p$ . Consequently, for most single-cell datasets, the time complexity of the IGRDCGA can be considered as O (M $*$ t_max).

6. Numerical Results

In order to evaluate the performances of the IGRDCGA, two frequently used clustering algorithms, k means and spectral clustering [57], a state-of-the-art single-cell classification algorithm SIMLR [58], are employed to compare it. To compare the performance of gene selection of the IGRDCGA, two frequently used dimensionality reduction algorithms, the PCA and tSNE, are also utilized to compare the IGRDCGA. For the SIMLR, we use its MATLAB program, which can be accessed by the address: https://github.com/BatzoglouLabSU/SIMLR. For the k means, PCA, and tSNE, the study utilizes the built-in functions in MATLAB.

6.1. Parameter Values

The related parameters of the IGRDCGA are described as follows.

(i) Parameter for IGR. The threshold $δ = 1 / 3$ .

(ii) Terminal Condition for the IGRDCGA. The population size $N p o p = 100$ ; the crossover probability $p_{c} = 0.6$ ; the mutation probability $p_{m} = 0.3$ ; the maximal number of iterations $t_{\max} = 100$ .

The other competing algorithms use their default parameter values.

6.2. Results

6.2.1. Comparisons of the IGR

The IGR of each gene shows its relevance. The IGR ranges from 0 to 1 while 0 represents that the gene has no relevance for the classification. Therefore, this can be employed to obviate irrelevant genes roughly. We compute the IGR of each gene for all scRNA-seq datasets and obtain the number of the irrelevant genes (#irrelevant genes) and their percentages in respect of the total genes. The obtained results are shown in Table 2.

Table 2

Irrelevant genes and their percentage.

Dataset	#Of genes	#Irrelevant genes	Percentage (%)
Allodiploid	2406	0	0
Sasagawa	32700	15412	47.13
Ramskold	21042	1005	4.78
Biase	25734	51	0.2
Ting	21583	4349	20.15
Chung	41821	14217	33.99
Yeo	27473	4042	14.71
Ning	19080	1962	10.28
Ginhoux	11834	0	0
Su	15333	0	0
Usoskin	25334	5800	22.89
Deng	22431	1257	5.6
Fan	26357	941	3.57
Goolam	40315	12592	31.23
Kolodz	13473	0	0
Nestorowa	4773	0	0
Treutlein	23271	10497	45.11
Camp15	18927	2135	11.28
Camp17	19020	1285	6.76
Yan	20214	619	3.06
Wang	19950	0	0
Li1	55186	12131	21.98
Li2	180253	40335	22.38
Patel	5948	0	0
Pollen	23730	1920	8.09
Manno_m	24378	2390	9.8
Manno_h	20560	0	0
Tasic	24057	2443	10.16
Zeisel	19486	0	0
Grun	23536	7928	33.68
Baron	14878	17	0.11
Muraro	19127	225	1.18
Xin	39851	5962	14.96

From Table 2, we can obviously see that there exist irrelevant genes in most datasets. The percentage of irrelevant genes versus the total genes can indicate gene redundancy rates. There are 24 datasets whose gene redundancy rate is larger than 0 in Table 2, which account for 72.72% (total 33 datasets). This demonstrates that the IGR = 0 is a good way to obviate the irrelevant genes.

However, Table 2 also shows that there are 9 datasets whose redundancy rate is 0. Namely, it cannot find the irrelevant genes from the 9 datasets by means of the IGR = 0. This also demonstrates that the IGR = 0 can only determine irrelevant genes and cannot determine irrelevant genes. Therefore, our proposed algorithm IGRDCGA utilizes a threshold method to eliminate irrelevant genes.

Comparatively speaking, the datasets containing more genes possess higher redundancy rates of genes and vice versa. Nevertheless, this is not absolute, as can be shown in Table 2.

6.2.2. Comparisons of Evaluation Metrics

We perform the IGRDCGA and several competing algorithms on the above scRNA-seq datasets described in Section 4.1. Three evaluation metrics NMI, ARI, and purity are employed to evaluate the performances of the IGRDCGA and the other four competing algorithms. All the algorithms are independently performed for 20 runs to obtain their average values. For all 33 datasets, the average values of NMI, ARI, and purity metrics are shown in Tables 3–5, respectively.

Table 3

NMI metric obtained by the IGRDCGA and several competing algorithms.

Dataset	K means	Spectral	SIMLR	IGRDCGA	PCA	tSNE
Allodiploid	0.3793	0.0778	1.0000	1.0000	1.0000	0.0490
Sasagawa	0.3048	0.1815	0.4665	0.9128	0.6349	0.0903
Ramskold	0.6868	0.4946	0.7594	0.9709	0.9394	0.3444
Biase	0.6690	0.5114	0.8321	1.0000	0.9501	0.0544
Ting	0.2161	0.1211	0.5082	0.5778	0.5746	0.0585
Chung	0.1280	0.0295	0.3170	0.4196	0.2278	0.3336
Yeo	0.2358	0.0480	0.7744	0.8168	0.7650	0.6831
Ning	0.1100	0.0388	0.3465	0.7432	0.3579	0.0073
Ginhoux	0.4094	0.0326	0.4189	0.5580	0.4777	0.0074
Su	0.3145	0.0676	0.4963	0.5951	0.3910	0.0201
Usoskin	0.1823	0.0363	0.4552	0.8334	0.6235	0.6579
Deng	0.7451	0.3860	0.6979	0.8977	0.7671	0.0334
Fan	0.2939	0.1910	0.4925	0.5874	0.5249	0.1417
Goolam	0.2826	0.1917	0.4464	0.7428	0.3282	0.0527
Kolodz	0.7024	0.0235	0.9634	1.0000	0.9790	0.0035
Nestorowa	0.3499	0.1983	0.3514	0.3774	0.3562	0.3795
Treutlein	0.2520	0.2585	0.4574	0.8074	0.5445	0.0754
Camp15	0.4025	0.0596	0.5098	0.4945	0.4257	0.4979
Camp17	0.6479	0.0735	0.7420	0.7344	0.7366	0.0117
Yan	0.7034	0.4988	0.7937	0.9480	0.8830	0.0872
Wang	0.0609	0.0613	0.2588	0.3038	0.2226	0.3274
Li1	0.4852	0.0634	0.7377	0.7623	0.7168	0.0275
Li2	0.2117	0.1807	0.3387	0.5915	0.3452	0.1425
Patel	0.1669	0.0429	0.6446	0.7436	0.7349	0.8248
Pollen	0.5846	0.1790	0.9272	0.9008	0.9055	0.8886
Manno_m	0.3084	0.0887	0.4026	0.3979	0.4065	0.4710
Manno_h	0.3796	0.6052	0.4867	0.4849	0.5287	0.5927
Tasic	0.6758	0.7707	0.7456	0.8458	0.8039	0.8180
Zeisel	0.4681	0.6586	0.7097	0.6076	0.5229	0.7391
Grun	0.0059	0.1145	0.4133	0.3729	0.0055	0.0005
Baron	0.2079	0.5731	0.5950	0.5480	0.4281	0.6618
Muraro	0.2930	0.6647	0.7001	0.6897	0.4189	0.0095
Xin	0.2587	0.4237	0.2422	0.7201	0.5445	0.0100

Table 4

ARI metric obtained by the IGRDCGA and several competing algorithms.

Dataset	K means	Spectral	SIMLR	IGRDCGA	PCA	tSNE
Allodiploid	0.2767	−0.0500	1.0000	1.0000	1.0000	−0.0125
Sasagawa	0.1322	0.0262	0.4306	0.9023	0.5238	−0.0037
Ramskold	0.3999	0.1379	0.5358	0.9481	0.8590	0.0024
Biase	0.5343	0.2827	0.7346	1.0000	0.9556	−0.0118
Ting	0.0985	0.0196	0.3019	0.3905	0.3044	0.0002
Chung	0.0420	−0.0021	0.2011	0.2059	0.1222	0.2216
Yeo	0.1408	−0.0068	0.7760	0.7964	0.7675	0.5966
Ning	0.0608	0.0007	0.3269	0.7456	0.3137	−0.0004
Ginhoux	0.3753	−0.0016	0.3379	0.6062	0.4790	−0.0002
Su	0.3035	0.0084	0.3329	0.5590	0.3180	0.0000
Usoskin	0.1033	0.0020	0.3007	0.8343	0.5319	0.5538
Deng	0.7033	0.1435	0.5059	0.8786	0.5460	0.0011
Fan	0.1156	0.0386	0.3323	0.4049	0.3529	0.0060
Goolam	0.1436	0.0656	0.2284	0.7564	0.1433	0.0005
Kolodz	0.6468	0.0010	0.9762	1.0000	0.9880	0.0011
Nestorowa	0.2236	0.0735	0.2166	0.2572	0.2284	0.2584
Treutlein	0.1204	0.0741	0.2615	0.8138	0.4066	−0.0038
Camp15	0.3614	0.0141	0.4402	0.4835	0.3537	0.4121
Camp17	0.4667	0.0031	0.4859	0.6025	0.5641	−0.0005
Yan	0.5801	0.2894	0.6304	0.9425	0.8453	−0.0045
Wang	0.0084	0.0025	0.1445	0.1239	0.1111	0.1926
Li1	0.2265	0.0002	0.6032	0.6023	0.5354	−0.0002
Li2	0.0280	0.0111	0.1496	0.3423	0.1793	0.0014
Patel	0.0581	−0.0012	0.6065	0.7383	0.7424	0.8437
Pollen	0.2423	0.0177	0.8436	0.8258	0.8326	0.7889
Manno_m	0.0721	0.0024	0.1317	0.1351	0.1328	0.1822
Manno_h	0.0774	0.3050	0.1489	0.1593	0.2006	0.2767
Tasic	0.4326	0.6113	0.6176	0.8090	0.6889	0.6716
Zeisel	0.3023	0.4970	0.5730	0.4676	0.3418	0.5866
Grun	−0.0300	0.0730	0.4375	0.5053	−0.0355	−0.0001
Baron	0.0516	0.2526	0.2887	0.3202	0.1450	0.3251
Muraro	0.1306	0.4838	0.5716	0.6391	0.2261	−0.0002
Xin	0.2802	0.2389	0.0771	0.8145	0.3374	0.0000

Table 5

Purity metric obtained by the IGRDCGA and several competing algorithms.

Dataset	K means	spectral	SIMLR	IGRDCGA	PCA	tSNE
Allodiploid	0.2767	−0.0500	1.0000	1.0000	1.0000	−0.0125
Sasagawa	0.0387	−0.2253	0.5020	0.9170	0.5731	0.1462
Ramskold	0.5953	0.1801	0.7386	0.9756	0.9331	0.5491
Biase	0.5124	0.1796	0.7896	1.0000	0.9636	0.2149
Ting	−0.1852	−0.4401	0.5643	0.4239	0.5204	0.3951
Chung	−0.1686	−0.4338	0.3331	0.1288	0.2860	0.3774
Yeo	−0.0742	−0.4393	0.8247	0.8352	0.8196	0.6856
Ning	−0.1955	−0.3228	0.3667	0.7629	0.3466	0.1128
Ginhoux	0.3983	−0.3019	0.3912	0.6425	0.5308	0.1045
Su	0.4949	−0.4755	0.5004	0.7007	0.5594	0.3739
Usoskin	−0.1365	−0.4286	0.3846	0.8584	0.6343	0.6506
Deng	0.7271	−0.0730	0.6137	0.8920	0.6378	0.2573
Fan	0.1441	−0.2779	0.5980	0.6421	0.6073	0.4231
Goolam	0.1340	−0.1726	0.3632	0.7725	0.2284	0.1841
Kolodz	0.5854	−0.2917	0.9783	1.0000	0.9891	0.1003
Nestorowa	0.5335	−0.0128	0.6124	0.6035	0.6294	0.6444
Treutlein	−0.0020	−0.1692	0.4146	0.8372	0.5156	0.2137
Camp15	0.4989	−0.4715	0.6333	0.6594	0.5855	0.6197
Camp17	0.6779	−0.6129	0.6976	0.7836	0.7586	0.4854
Yan	0.6861	0.3208	0.7748	0.9620	0.9001	0.3921
Wang	−0.3533	−0.5255	0.4835	0.2183	0.4099	0.5145
Li1	0.2909	−0.7039	0.8271	0.8050	0.7802	0.5820
Li2	−0.2315	0.4114	0.4866	0.4227	0.3781	0.4231
Patel	−0.3625	−0.5657	0.7379	0.8244	0.8327	0.8982
Pollen	0.2559	−0.5278	0.9414	0.9271	0.9369	0.9210
Manno_m	0.4832	−0.8098	0.8236	0.7025	0.8149	0.8482
Manno_h	0.6085	0.9307	0.8803	0.8370	0.9099	0.9268
Tasic	0.6178	0.8780	0.8747	0.9231	0.8948	0.8939
Zeisel	0.5849	0.7180	0.7558	0.6948	0.6250	0.7646
Grun	0.2741	0.0728	0.4554	0.6157	0.2331	−0.0003
Baron	0.1026	0.5293	0.5466	0.5121	0.4274	0.5723
Muraro	0.2705	0.7018	0.7430	0.7552	0.4868	0.4388
Xin	0.1638	0.3518	0.2060	0.8166	0.4312	0.1518

From Table 3, we can obviously see that, for 22 of 33 datasets, the IGRDCGA gains the largest NMI in six algorithms, which account for 66.67% (22 of 33 datasets). Meanwhile, we can also observe that, for 6 of the rest 12 datasets, the differences of the largest NMI and the NMI obtained by the IGRDCGA are very little. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest NMI is, respectively, 0, 1, 6, 2, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of NMI.

Table 4 shows that, for 24 of 33 datasets, the IGRDCGA acquires the maximal ARI in six algorithms, which account for 72.72%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the maximal ARI is, respectively, 0, 0, 3, 1, and 7. By contrast, the IGRDCGA is superior to the other five competing algorithms in terms of ARI.

From Table 5, it can be clearly seen that, for 21 of 33 datasets, the purity metrics obtained by the IGRDCGA are the largest in six algorithms, which account for 63.63%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest purity metrics is, respectively, 0, 1, 6, 1, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of purity metric.

By comparing Tables 3–5, we summarize the best evaluation metrics obtained by six algorithms in Table 6, where the NMI, ARI, and purity metrics are, respectively, denoted by 1, 2, and 3.

Table 6

The best evaluation metrics obtained by six algorithms.

Dataset	K means	spectral	SIMLR	IGRDCGA	PCA	tSNE
Allodiploid			1, 2, 3	1, 2, 3	1, 2, 3
Sasagawa				1, 2, 3
Ramskold				1, 2, 3
Biase				1, 2, 3
Ting			3	1, 2
Chung			3	1, 2
Yeo				1, 2, 3
Ning				1, 2, 3
Ginhoux				1, 2, 3
Su				1, 2, 3
Usoskin				1, 2, 3
Deng				1, 2, 3
Fan				1, 2, 3
Goolam				1, 2, 3
Kolodz				1, 2, 3
Nestorowa						1, 2, 3
Treutlein				1, 2, 3
Camp15			1	2, 3
Camp17			1	2, 3
Yan				1, 2, 3
Wang						1, 2, 3
Li1			2, 3	1
Li2			3	1, 2
Patel						1, 2, 3
Pollen			1, 2, 3
Manno_m						1, 2, 3
Manno_h		1, 3				2
Tasic				1, 2, 3
Zeisel						1, 2, 3
Grun			1	2, 3
Baron						1, 2, 3
Muraro			1	2, 3
Xin				1, 2, 3

From Table 6, we can clearly see that, for 15 of 33 datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are all the best in six algorithms, which account for 45.45%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the best NMI, ARI and purity metrics is, respectively, 0, 0, 2, 1, and 6. Thus, the IGRDCGA outperforms the other five competing algorithms in terms of NMI, ARI, and purity metrics.

In the meantime, Table 6 shows that only in partial datasets does the IGRDCGA gain the best NMI, ARI, and purity metrics. For the Allodiploid, the IGRDCGA, SIMLR, and PCA obtain the best NMI, ARI, and purity metrics. We can observe from Table 1 that the Allodiploid possesses much less number of cells and genes compared with the other datasets. Namely, for the datasets with smaller dimensions, an algorithm is easy to obtain the best NMI, ARI, and purity metrics. In this case, the PCA is the fittest method as it is the simplest and easiest to implement in the above three algorithms. For the Ting, Chung, and Li2, the IGRDCGA obtains the best NMI and ARI, while the SIMLR obtains the best purity metrics. For the Camp15, Camp17, Grun, and Muraro, the IGRDCGA obtains the best ARI and purity metrics, while the SIMLR obtains the best NMI. By the comparisons of the IGRDCGA with the SIMLR, we can clearly see that the number of the best metrics obtained by the IGRDCGA is far larger than that obtained by the SIMLR. This fully demonstrates that the IGRDCGA is superior to the SIMLR in terms of NMI, ARI, and purity metrics.

For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron datasets, the tSNE gains the best NMI, ARI, and purity metrics. However, for the other datasets, three evaluation metrics obtained by the tSNE are far worse than those obtained by the IGRDCGA. As shown in Table 3, there are 17 datasets whose NMI of the tSNE is less than 0.1, while there are none datasets whose NMI of the IGRDCGA is less than 0.1. In Table 4, there are 20 datasets whose ARI of the tSNE is less than 0.1, while there are none datasets whose ARI of the IGRDCGA is less than 0.1. In Table 5, there are 19 datasets whose purity metrics of the tSNE are less than 0.5, while there are only 4 datasets whose purity metrics of the IGRDCGA are less than 0.5. Namely, for most datasets except the above six datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are superior to those obtained by the tSNE. This attests that the IGRDCGA outperforms the tSNE for most datasets in terms of NMI, ARI, and purity metrics.

By elaborative analyses for the above six datasets and the other datasets, we can clearly observe the following principal features and differences. To begin with, the six datasets all possess very low redundancy rates of genes. Table 2 shows that the redundancy rates of the Zeisel and Baron are, respectively, 9.8 and 0.11, and the other four datasets are all 0. For the datasets with low redundancy rates, the IGRDCGA may lose some genes to cause the decreasing of the classification performances. In addition, the six datasets have relatively low NMI and ARI for six algorithms. From Table 3, we can easily see that all the NMI metrics of the Nestorowa, Wang, and Manno_m are, respectively, less than 0.38, 0.33, and 0.48. From Table 4, it can be obviously seen that all the ARI metrics of the Nestorowa, Wang, Manno_m, and Baron are, respectively, less than 0.26, 0.20, 0.19, and 0.33. Thirdly, the six datasets are approximately and completely sparse. From our provided appendix file, it can obviously observe that the data values within [0, 0.1] in all datasets are more than those of the other data intervals. Nevertheless, the situation of the six datasets is a great deal more highlighted, and they are approximately and completely sparse if we let the data values within [0, 0.1] to be 0. For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron, the data values within [0, 0.1], respectively, account for more than 88%, 95%, 73%, 91%, 80%, and 91%, which are so high that they are nearly cover up the data values within the other ranges.

6.2.3. Iteration Plots and Heat Maps

In order to clearly illustrate the performance of the IGRDCGA, three scRNA-seq datasets, Allodiploid, Chung, and Kolodz, are selected as representative datasets according to the different values of the NMI metrics in Table 3. For the Allodiploid, Yeo, and Grun, their iteration plots are, respectively, illustrated in Figures 2–4. For the Allodiploid, the NMI metric obtained by the IGRDCGA is 1, which represents that the NMI metrics are consistent with the ground truth class labels. It can be clearly seen from Figure 2 that the iteration plot of the Allodiploid is parallel to the X-axis and its value is the maximal NMI of 1. Namely, the IGRDCGA obtains the maximal NMI at the first iteration. Therefore, the later iterations are useless. In the above 33 scRNA-seq datasets, only the Allodiploid displays the plot illustrated in Figure 2.

[figure omitted; refer to PDF]

In Figure 3, the iteration plot of Yeo is ascending as the number of iterations increases. The ascending trend is very obvious at the early period of the iteration, and the ascending trend stops as the number of iterations increases. Namely, the NMI metric turns into a fixed number at the later period of the iteration. In the above 33 scRNA-seq datasets, most datasets display a similar plot shown in Figure 3.

From Figure 4, we can obviously see that the iteration plot of Grun is always ascending as the number of iterations increases. The ascending trend is always very obvious at the whole period of the iteration. In the above 33 scRNA-seq datasets, a few datasets display a similar plot shown in Figure 4.

In order to illustrate the classification performance of the IGRDCGA more visually, the heat maps of the Allodiploid, Sasagawa, and Yan are illustrated in Figures 5–7, respectively.

[figure omitted; refer to PDF]

The Allodiploid contains 16 cells, 2406 genes, and 2 classes, whose ground truth class labels are illustrated in the first column of Figure 5. From Figure 5, it can be obviously seen that the results obtained by the IGRDCGA, SIMLR, and PCA are fully concordant with the ground truth class labels; those obtained by the k means and spectral clustering are nearly one class; and those obtained by the tSNE are oscillatory. Obviously, the results obtained by the k means, spectral clustering, and the tSNE are a great deal worse than those obtained by the IGRDCGA, SIMLR, and PCA; the results of the tSNE are the worst in the six algorithms.

The dataset Sasagawa contains 23 cells, 32700 genes, and 3 classes, whose ground truth labels are illustrated in the first column of Figure 6. From Figure 6, it can be obviously observed that the results obtained by the IGRDCGA are fully concordant with the ground truth class labels; those obtained by spectral clustering are nearly one class; those obtained by the tSNE are oscillatory and the worst. In the results obtained by the k means, SIMLR, and PCA, the results of the subclass 2 (the ground truth class labels are 2) are fully concordant with the ground truth class labels while the other subclasses are confused. Therefore, for the Sasagawa, the IGRDCGA gains the best results while the tSNE gains the worst results.

The first column of Figure 7 illustrates the ground truth labels of the Yan, which contains 90 cells, 20214 genes, and 6 classes. From Figure 7, we can clearly see that none of the results obtained by six algorithms are concordant with the ground truth class labels. However, the results obtained by the IGRDCGA are closer to the ground truth class labels than those obtained by the other algorithms, and the results of the subclasses 2, 3, 4, 5, and 6 are fully concordant with the ground truth class labels. The results obtained by the tSNE are oscillatory and the worst in the six algorithms. The results obtained by the other algorithms are all confused. Therefore, for the Yan, the IGRDCGA obtains the best results whereas the tSNE obtains the worst results.

7. Conclusion and Future Work

In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data. It combines information gain ratio (IGR) and genetic algorithm with dynamic crossover (DCGA) and are abbreviated as IGRDCGA. It utilizes information gain ratio to eliminate irrelevant genes roughly and utilizes DCGA to choose high quality genes finely. We have conducted the IGRDCGA and several competing algorithms on 33 publicly available scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can eliminate irrelevant genes and choose high quality genes effectively, and it is superior to the other several competing algorithms in terms of classification accuracy.

This algorithm is going on for further enhancement and improvement. One attempt is to utilize a more efficient coding to speed up its converging rate and stability. Another attempt is to extend the IGRDCGA to classification algorithms of the other high dimensional problems.

Acknowledgments

This research was supported by Science Research Foundation for High-level Talents of Yulin Normal University (no. G2021ZK17).

References

[1] Y. Wang, H. Liu, F. Wei, T. Zong, X. Li, "Cooperative coevolution with formula-based variable grouping for large-scale global optimization," Evolutionary Computation, vol. 26 no. 4, pp. 569-596, DOI: 10.1162/evco_a_00214, 2018.

[2] B. Nakisa, M. N. Rastgoo, D. Tjondronegoro, V. Chandran, "Evolutionary computation algorithms for feature selection of EEG-based emotion recognition using mobile sensors," Expert Systems with Applications, vol. 93, pp. 143-155, DOI: 10.1016/j.eswa.2017.09.062, 2018.

[3] D. Y. Eroglu, K. Kilic, "A novel hybrid genetic local search algorithm for feature selection and weighting with an application in strategic decision making in innovation management," Information Sciences, vol. 405, pp. 18-32, DOI: 10.1016/j.ins.2017.04.009, 2017.

[4] N. Maleki, Y. Zeinali, "A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection," Expert Systems with Applications, vol. 164, 2020.

[5] M. Tahir, A. Tubaishat, F. Al-Obeidat, "A novel binary chaotic genetic algorithm for feature selection and its utility in affective computing and healthcare," Neural Computing and Applications,DOI: 10.1007/s00521-020-05347-y, 2020.

[6] A. H. Mohammad, "Comparing two feature selections methods (information gain and gain ratio) on three different classification algorithms using Arabic dataset," Journal of Theoretical & Applied Information Technology, vol. 96 no. 6, pp. 1561-1569, 2018.

[7] H. Uğuz, "A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm," Knowledge-Based Systems, vol. 24 no. 7, pp. 1024-1032, 2011.

[8] A. Chinnaswamy, R. Srinivasan, "Hybrid information gain based fuzzy roughset feature selection in cancer microarray data," Proceedings of the International Conference on Innovations in Power and Advanced Computing Technologies (I-PACT), .

[9] C. Dai, Y. Wang, M. Ye, X. Xue, H. Liu, "An orthogonal evolutionary algorithm with learning automata for multiobjective optimization," IEEE Transactions on Cybernetics, vol. 46 no. 12, pp. 3306-3319, DOI: 10.1109/TCYB.2015.2503433, 2016.

[10] X. Xue, Y. Wang, "Using memetic algorithm for instance coreference resolution," IEEE Transactions on Knowledge and Data Engineering, vol. 28 no. 2, pp. 580-591, DOI: 10.1109/tkde.2015.2475755, 2016.

[11] J. Liu, Y. Wang, N. Fan, S. Wei, W. Tong, "A convergence-diversity balanced fitness evaluation mechanism for decomposition- based many-objective optimization algorithm," Integrated Computer-Aided Engineering, vol. 26 no. 2, pp. 159-184, DOI: 10.3233/ica-180594, 2019.

[12] H. Liu, Y. Wang, L. Liu, X. Li, "A two phase hybrid algorithm with a new decomposition method for large scale optimization," Integrated Computer-Aided Engineering, vol. 25 no. 4, pp. 349-367, DOI: 10.3233/ica-170571, 2018.

[13] M. Ye, "A hybrid genetic algorithm for the minimum exposure path problem of wireless sensor networks based on a numerical functional extreme model," IEEE Transactions on Vehicular Technology, vol. 65 no. 10, pp. 8644-8657, 2015.

[14] X. Xue, Y. Wang, "Optimizing ontology alignments through a memetic algorithm using both MatchFmeasure and unanimous improvement ratio," Artificial Intelligence, vol. 223, pp. 65-81, DOI: 10.1016/j.artint.2015.03.001, 2015.

[15] D. Cai, Y. Wang, "A new decomposition based evolutionary algorithm with uniform designs for many-objective optimization," Applied Soft Computing, vol. 30 no. 1, pp. 238-248, 2015.

[16] Y.-M. Cheung, F. Gu, H.-L. Liu, "Objective extraction for many-objective optimization problems: algorithm and test problems," IEEE Transactions on Evolutionary Computation, vol. 20 no. 5, pp. 755-772, DOI: 10.1109/tevc.2016.2519758, 2016.

[17] P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, "Normalized mutual information feature selection," IEEE Transactions on Neural Networks, vol. 20 no. 2, pp. 189-201, DOI: 10.1109/TNN.2008.2005601, 2009.

[18] A. F. McDaid, D. Greene, N. Hurley, "Normalized mutual information to evaluate overlapping community finding algorithms," 2011. https://arxiv.org/abs/1110.2515

[19] O. Abedinia, N. Amjady, H. Zareipour, "A new feature selection technique for load and price forecast of electrical power systems," IEEE Transactions on Power Systems, vol. 32 no. 1, pp. 62-74, DOI: 10.1109/tpwrs.2016.2556620, 2017.

[20] C. Xu, Z. Su, "Identification of cell types from single-cell transcriptomes using a novel clustering method," Bioinformatics, vol. 31 no. 12, pp. 1974-1980, DOI: 10.1093/bioinformatics/btv088, 2015.

[21] V. Y. Kiselev, K. Kirschner, M. T. Schaub, T. Andrews, A. Yiu, T. Chandra, K. N. Natarajan, W. Reik, M. Barahona, A. R. Green, M. Hemberg, "SC3: consensus clustering of single-cell RNA-seq data," Nature Methods, vol. 14 no. 5, pp. 483-486, DOI: 10.1038/nmeth.4236, 2017.

[22] S. Wagner, D. Wagner, "Comparing clusterings: an overview," 2007.

[23] X. Li, X. L. Cui, J. Q. Wang, Y. K. Wang, Y. F. Li, L. Y. Wang, "Generation and application of mouse-rat allodiploid embryonic stem cells," Cell, vol. 164 no. 1-2, pp. 279-292, DOI: 10.1016/j.cell.2015.11.035, 2016.

[24] Y. Sasagawa, I. Nikaido, T. Hayashi, H. Danno, K. D. Uno, T. Imai, H. R. Ueda, "Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity," Genome Biology, vol. 14 no. 4,DOI: 10.1186/gb-2013-14-4-r31, 2013.

[25] Ramsköld, S. Luo, Y. C. Wang, R. Li, Q. Deng, O. R. Faridani, G. A. Daniels, I. Khrebtukova, J. F. Loring, L. C. Laurent, G. P. Schroth, R. Sandberg, "Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells," Nature Biotechnology, vol. 30 no. 8, pp. 777-782, DOI: 10.1038/nbt.2282, 2012.

[26] F. H. Biase, X. Cao, S. Zhong, "Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing," Genome Research, vol. 24 no. 11, pp. 1787-1796, DOI: 10.1101/gr.177725.114, 2014.

[27] D. T. Ting, B. S. Wittner, M. Ligorio, N. Vincent Jordan, A. M. Shah, D. T. Miyamoto, N. Aceto, F. Bersani, B. W. Brannigan, K. Xega, J. C. Ciciliano, H. Zhu, O. C. MacKenzie, J. Trautwein, K. S. Arora, M. Shahid, H. L. Ellis, N. Qu, N. Bardeesy, M. N. Rivera, V. Deshpande, C. R. Ferrone, R. Kapur, S. Ramaswamy, T. Shioda, M. Toner, S. Maheswaran, D. A. Haber, "Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells," Cell Reports, vol. 8 no. 6, pp. 1905-1918, DOI: 10.1016/j.celrep.2014.08.029, 2014.

[28] W. Chung, H. H. Eum, H.-O. Lee, K.-M. Lee, H.-B. Lee, K.-T. Kim, H. S. Ryu, S. Kim, J. E. Lee, Y. H. Park, Z. Kan, W. Han, W.-Y. Park, "Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer," Nature Communications, vol. 8 no. 1,DOI: 10.1038/ncomms15081, 2017.

[29] T. Yeo, S. J. Tan, C. L. Lim, D. P. X. Lau, Y. W. Chua, S. S. Krisna, G. Iyer, G. S. Tan, T. K. H. Lim, D. S. W. Tan, W.-T. Lim, C. T. Lim, "Microfluidic enrichment for the single cell analysis of circulating tumor cells," Scientific Reports, vol. 6 no. 1, pp. 22076-22087, DOI: 10.1038/srep22076, 2016.

[30] N. Leng, L. F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, C. Kendziorski, "Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments," Nature Methods, vol. 12 no. 10, pp. 947-950, DOI: 10.1038/nmeth.3549, 2015.

[31] A. Schlitzer, V. Sivakamasundari, J. Chen, H. R. B. Sumatoh, J. Schreuder, J. Lum, B. Malleret, S. Zhang, A. Larbi, F. Zolezzi, L. Renia, M. Poidinger, S. Naik, E. W. Newell, P. Robson, F. Ginhoux, "Identification of cDC1- and cDC2-committed DC progenitors reveals early lineage priming at the common DC progenitor stage in the bone marrow," Nature Immunology, vol. 16 no. 7, pp. 718-728, DOI: 10.1038/ni.3200, 2015.

[32] X. Su, Y. Shi, X. Zou, Z.-N. Lu, G. Xie, J. Y. H. Yang, C.-C. Wu, X.-F. Cui, K.-Y. He, Q. Luo, Y.-L. Qu, N. Wang, L. Wang, Z.-G. Han, "Single-cell RNA-Seq analysis reveals dynamic trajectories during mouse liver development," BMC Genomics, vol. 18 no. 1,DOI: 10.1186/s12864-017-4342-x, 2017.

[33] D. Usoskin, A. Furlan, S. Islam, H. Abdo, P. Lönnerberg, D. Lou, J. Hjerling-Leffler, J. Haeggström, O. Kharchenko, P. V. Kharchenko, S. Linnarsson, P. Ernfors, "Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing," Nature Neuroscience, vol. 18 no. 1, pp. 145-153, DOI: 10.1038/nn.3881, 2015.

[34] Q. Deng, D. Ramsköld, B. Reinius, R. Sandberg, "Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells," Science, vol. 343 no. 6167, pp. 193-196, DOI: 10.1126/science.1245316, 2014.

[35] X. Fan, X. Zhang, X. Wu, H. Guo, Y. Hu, F. Tang, Y. Huang, "Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos," Genome Biology, vol. 16 no. 1, pp. 148-217, DOI: 10.1186/s13059-015-0706-1, 2015.

[36] M. Goolam, A. Scialdone, S. J. L. Graham, I. C. Macaulay, A. Jedrusik, A. Hupalowska, T. Voet, J. C. Marioni, M. Zernicka-Goetz, "Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos," Cell, vol. 165 no. 1, pp. 61-74, DOI: 10.1016/j.cell.2016.01.047, 2016.

[37] A. A. Kolodziejczyk, J. K. Kim, J. C. H. Tsang, T. Ilicic, J. Henriksson, K. N. Natarajan, A. C. Tuck, X. Gao, M. Bühler, P. Liu, J. C. Marioni, S. A. Teichmann, "Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation," Cell Stem Cell, vol. 17 no. 4, pp. 471-485, DOI: 10.1016/j.stem.2015.09.011, 2015.

[38] S. Nestorowa, F. K. Hamey, B. Pijuan Sala, E. Diamanti, M. Shepherd, E. Laurenti, N. K. Wilson, D. G. Kent, B. Göttgens, "A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation," Blood, vol. 128 no. 8, pp. e20-e31, DOI: 10.1182/blood-2016-05-716480, 2016.

[39] B. Treutlein, D. G. Brownfield, A. R. Wu, N. F. Neff, G. L. Mantalas, F. H. Espinoza, T. J. Desai, M. A. Krasnow, S. R. Quake, "Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq," Nature, vol. 509 no. 7500, pp. 371-375, DOI: 10.1038/nature13173, 2014.

[40] J. G. Camp, F. Badsha, M. Florio, S. Kanton, T. Gerber, M. Wilsch-Bräuninger, E. Lewitus, A. Sykes, W. Hevers, M. Lancaster, J. A. Knoblich, R. Lachmann, S. Pääbo, W. B. Huttner, B. Treutlein, "Human cerebral organoids recapitulate gene expression programs of fetal neocortex development," Proceedings of the National Academy of Sciences, vol. 112 no. 51, pp. 15672-15677, DOI: 10.1073/pnas.1520760112, 2015.

[41] J. G. Camp, K. Sekine, T. Gerber, H. Loeffler-Wirth, H. Binder, M. Gac, S. Kanton, J. Kageyama, G. Damm, D. Seehofer, L. Belicova, M. Bickle, R. Barsacchi, R. Okuda, E. Yoshizawa, M. Kimura, H. Ayabe, H. Taniguchi, T. Takebe, B. Treutlein, "Multilineage communication regulates human liver bud development from pluripotency," Nature, vol. 546 no. 7659, pp. 533-538, DOI: 10.1038/nature22796, 2017.

[42] L. Yan, M. Yang, H. Guo, L. Yang, J. Wu, R. Li, P. Liu, Y. Lian, X. Zheng, J. Yan, J. Huang, M. Li, X. Wu, L. Wen, K. Lao, R. Li, J. Qiao, F. Tang, "Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells," Nature Structural & Molecular Biology, vol. 20 no. 9, pp. 1131-1139, DOI: 10.1038/nsmb.2660, 2013.

[43] Y. J. Wang, J. Schug, K.-J. Won, C. Liu, A. Naji, D. Avrahami, M. L. Golson, K. H. Kaestner, "Single-cell transcriptomics of the human endocrine pancreas," Diabetes, vol. 65 no. 10, pp. 3028-3038, DOI: 10.2337/db16-0405, 2016.

[44] H. Li, E. T. Courtois, D. Sengupta, Y. Tan, K. H. Chen, J. J. L. Goh, S. L. Kong, C. Chua, L. K. Hon, W. S. Tan, M. Wong, P. J. Choi, L. J. K. Wee, A. M. Hillmer, I. B. Tan, P. Robson, S. Prabhakar, "Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors," Nature Genetics, vol. 49 no. 5, pp. 708-718, DOI: 10.1038/ng.3818, 2017.

[45] L. Jin, "Single‐cell transcriptomes reveal characteristic features of human pancreatic islet cell types," EMBO Reports, vol. 17 no. 2, pp. 178-187, 2016.

[46] A. P. Patel, I. Tirosh, J. J. Trombetta, A. K. Shalek, S. M. Gillespie, H. Wakimoto, D. P. Cahill, B. V. Nahed, W. T. Curry, R. L. Martuza, D. N. Louis, O. Rozenblatt-Rosen, M. L. Suvà, A. Regev, B. E. Bernstein, "Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma," Science, vol. 344 no. 6190, pp. 1396-1401, DOI: 10.1126/science.1254257, 2014.

[47] A. Alex, "Pollen, Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex," Nature Biotechnology, vol. 32 no. 10, pp. 1053-1058, 2014.

[48] G. La Manno, D. Gyllborg, S. Codeluppi, K. Nishimura, C. Salto, A. Zeisel, L. E. Borm, S. R. W. Stott, E. M. Toledo, J. C. Villaescusa, P. Lönnerberg, J. Ryge, R. A. Barker, E. Arenas, S. Linnarsson, "Molecular diversity of midbrain development in mouse, human, and stem cells," Cell, vol. 167 no. 2, pp. 566-580, DOI: 10.1016/j.cell.2016.09.027, 2016.

[49] B. Tasic, V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, L. T. Gray, S. A. Sorensen, T. Dolbeare, D. Bertagnolli, J. Goldy, N. Shapovalova, S. Parry, C. Lee, K. Smith, A. Bernard, L. Madisen, S. M. Sunkin, M. Hawrylycz, C. Koch, H. Zeng, "Adult mouse cortical cell taxonomy revealed by single cell transcriptomics," Nature Neuroscience, vol. 19 no. 2, pp. 335-346, DOI: 10.1038/nn.4216, 2016.

[50] Z. Amit, "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq," Science, vol. 347 no. 6226, pp. 1138-1142, 2015.

[51] D. Grün, M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, J. van Es, E. Jansen, H. Clevers, E. J. P. de Koning, A. van Oudenaarden, "De novo prediction of stem cell identity using Single-Cell transcriptome data," Cell Stem Cell, vol. 19 no. 2, pp. 266-277, DOI: 10.1016/j.stem.2016.05.010, 2016.

[52] M. Baron, A. Veres, S. L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. H. Ryu, B. K. Wagner, S. S. Shen-Orr, A. M. Klein, D. A. Melton, I. Yanai, "A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure," Cell Systems, vol. 3 no. 4, pp. 346-360, DOI: 10.1016/j.cels.2016.08.011, 2016.

[53] M. J. Muraro, G. Dharmadhikari, D. Grün, N. Groen, T. Dielen, E. Jansen, L. van Gurp, M. A. Engelse, F. Carlotti, E. J. P. de Koning, A. van Oudenaarden, "A single-cell transcriptome atlas of the human pancreas," Cell Systems, vol. 3 no. 4, pp. 385-394, DOI: 10.1016/j.cels.2016.09.002, 2016.

[54] Y. Xin, J. Kim, H. Okamoto, M. Ni, Y. Wei, C. Adler, A. J. Murphy, G. D. Yancopoulos, C. Lin, J. Gromada, "RNA sequencing of single human islet cells reveals type 2 diabetes genes," Cell Metabolism, vol. 24 no. 4, pp. 608-615, DOI: 10.1016/j.cmet.2016.08.018, 2016.

[55] L. Adam, D. Lipowska, "Roulette-wheel selection via stochastic acceptance," Physica A: Statistical Mechanics and Its Applications, vol. 391 no. 6, pp. 2193-2196, 2012.

[56] V. Ho-Huu, T. Nguyen-Thoi, T. Truong-Khac, L. Le-Anh, T. Vo-Duy, "An improved differential evolution based on roulette wheel selection for shape and size optimization of truss structures with frequency constraints," Neural Computing and Applications, vol. 29 no. 1, pp. 167-185, DOI: 10.1007/s00521-016-2426-1, 2018.

[57] W. Chen, G. Feng, "Spectral clustering: a semi-supervised approach," Neurocomputing, vol. 77 no. 1, pp. 229-242, DOI: 10.1016/j.neucom.2011.09.002, 2012.

[58] B. Wang, J. Zhu, E. Pierson, D. Ramazzotti, S. Batzoglou, "Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning," Nature Methods, vol. 14 no. 4, pp. 414-416, DOI: 10.1038/nmeth.4207, 2017.

Word count: 8050

Show less

Copyright © 2022 Junhong Feng et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Single-cell RNA sequencing (scRNA-seq) is emerging as a promising technology. There exist a huge number of genes in a scRNA-seq data. However, some genes are high quality genes, and some are noises and irrelevant genes because of unspecific technology reasons. These noises and irrelevant genes may have a strong influence on downstream data analyses, such as a cell classification, gene function analysis, and cancer biomarker detection. Therefore, it is very significant to obviate these irrelevant genes and choose high quality genes by gene selection methods. In this study, a novel gene selection and classification method is presented by combining the information gain ratio and the genetic algorithm with dynamic crossover (abbreviated as IGRDCGA). The information gain ratio (IGR) is employed to eliminate irrelevant genes roughly and obtain a preliminary gene subset, and then the genetic algorithm with a dynamic crossover (DCGA) is utilized to choose high quality genes finely from the preliminary gene subset. The main difference between the IGRDCGA and the existing methods is that the DCGA and IGR are integrated first and used to select genes from scRNA-seq data. We conduct the IGRDCGA and several competing methods on some real-world scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can choose high quality genes effectively and efficiently and outperforms the other several competing methods in terms of both the dimensionality reduction and the classification accuracy.

Details

Title

Gene Selection and Classification of scRNA-seq Data Combining Information Gain Ratio and Genetic Algorithm with Dynamic Crossover

Author

Feng, Junhong¹

; Niu, Xishuan¹

; Zhang, Jie¹

; Jian-Hong, Wang²

¹ School of Computer Science and Engineering, Yulin Normal University, Yulin 537000, Guangxi, China
² Department of Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung 411030, Taiwan

Editor

Chao-Yang Lee

Publication year

2022

Publication date

2022

Publisher

John Wiley & Sons, Inc.

e-ISSN

15308677

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2022/9639304

ProQuest document ID

2628208334

Gene Selection and Classification of scRNA-seq Data Combining Information Gain Ratio and Genetic Algorithm with Dynamic Crossover

Jump to:

Full text

Abstract

Details

Suggested sources