This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
1. Introduction
In scRNA-seq data, there often are amounts of genes and may reach tens of thousands. Some genes are irrelevant or unsuitable for classification tasks, and they may seriously affect the efficiency of downstream data analysis. If all genes are utilized in data classifications, the classification accuracy and classification efficiency may be low. In order to obviate these irrelevant genes and select high quality genes, an effective and efficient gene selection algorithm is vital.
Feature selection (FS) problems can be taken as large-scale global optimization problems [1]; therefore, we can use bioinspired intelligence optimization algorithms to address feature selection problems. Wang et al. [1] took FS problems regard as large-scale global optimization problems. Nakisa et al. [2] utilized the evolutionary computation (EC) to search the optimal feature subset. Eroglu and Kilic [3] integrated a genetic local search algorithm and a k-nearest neighbor classifier to select feature subset. Maleki and Zeinali [4] used a hybrid genetic algorithm (GA) to address dimension reductions and applied it to the classification of lung cancer. Tahir et al. [5] presented a binary chaotic GA to select feature for healthcare datasets. However, how to correctly use the GA to address the gene selection and classification of scRNA-seq data is a significant issue to consider first. To the best of our knowledge, there are only a few literatures so far.
The study integrates the IGR and DCGA to address the gene selection and classification of scRNA-seq data and proposes a novel gene selection and classification algorithm IGRDCGA. The IGRDCGA utilizes the IGR to eliminate irrelevant genes roughly and obtain a preliminary gene subset and then employs the DCGA to choose high quality genes finely from the preliminary gene subset.
The rest of this study is organized as follows. Section 2 briefly describes the information gain ratio and genetic algorithm. Section 3 states three evaluation metrics. The dataset and preprocessing method to use in the study are described in Sections 4. The coding and the other details of the IGRDCGA are described in Section 5. The numerical results of the IGRDCGA and several competing algorithms are given in Section 6. The conclusion of the study is made in Section 7.
2. Related Work
In this section, the information gain ratio and genetic algorithm are described as follows.
2.1. Information Gain Ratio
The information gain is a metric derived from information entropy, often used to evaluate the mutual dependence level between two random variables. Namely, it is a symmetrical metric of dependency [6].
For two discrete random variables X and Y, their information entropy can be calculated, respectively, in terms of the following formulas:
The conditional entropy and information gain [6–8] of X versus Y can be calculated in terms of the first two following formulas, respectively. The information gain ratio of X versus Y is the ratio of the information gain to the information entropy, which is formulated in the last following formula.
The
2.2. Genetic Algorithm
The genetic algorithm (GA) is a bioinspired intelligence optimization algorithm. It is inspired by the process of a natural selection and belongs to one of evolutionary algorithms (EAs) [9–13]. It is commonly utilized to generate feasible solutions for optimization problems by performing the operators such as selection, crossover, and mutation [14–16]. The selection operator is designed to choose a part of chromosomes for crossover operator from previous population. The frequently used selection operator is random selection strategy, such as tournament selection strategy. The crossover operator is designed to exchange one or many genes in two parents that are selected by selection operator. It simulates reproduction or recombination in biological evolution process. GA determines whether to perform mutation operator or not according to crossover probability. By crossover operator, two parents may generate two or many offsprings in terms of different crossover strategy. The frequently used crossover operator includes single-point crossover, two-point crossover, and multipoint crossover. Mutation operator is designed to modify one or many genes in certain chromosome. It simulates gene mutation in biological evolution process. Similarly, GA determines whether to perform mutation operator or not according to mutation probability. The frequently used mutation operator includes locus mutation, exchange mutation, and insertion mutation. Mutation operator only acts on one chromosome while crossover operator acts on two chromosomes. Generally, crossover operator probability is much larger than mutation probability. This accords with a biological evolution process.
3. Evaluation Metrics
To evaluate the performance of the IGRDCGA, in the study, we utilize the following evaluation metrics: NMI (normalized mutual information) [17–19], ARI (adjusted random index) [20, 21], and purity [22].
3.1. NMI (Normalized Mutual Information)
The NMI is a frequently used evaluation metric. It can be often used to evaluate the accuracy and the difference between the obtained clustering results and the ground truth results.
For two discrete random variables X and Y, their MI (mutual information) can be calculated in terms of the following formula:
The normalized MI is taken as NMI, which can be calculated in terms of the following formula.
Example 1.
Suppose that the ground truth class labels of a single-cell dataset are as follows: y1 = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3] and the classification result of an algorithm is as follows: y2 = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3]. It follows from y1 and y2 that their unique value is both [1 2 3]. Then, the probability values in formula (6) are calculated as follows:
According to formula (6), it follows that
According to formula (1) or (2), it follows that
According to formula (7), it follows that
3.2. ARI (Adjusted Rand Index)
The ARI is another widely used evaluation metric for measuring the concordance between two clustering results. The RI (Rand Index) measures the similarity between two results. It calculates all pairs of samples, including the pairs in identical or different clusters, and pairs in the predicted and ground truth clusters.
For two clustering results X and Y containing n elements, the RI of them is calculated in terms of the following formula.
The overlap between X and Y can be summarized in a contingency table, in which each element represents the number of objects in common between X and Y. The ARI is adjusted RI, which can be calculated in terms of the following formula.
The ARI is the corrected-for-chance version of the RI. The RI may only yield a value between 0 and 1; the ARI can yield negative values if the index is less than the expected index. The optimal score of the ARI is 1, which represents that two clustering results are identical. A larger ARI signifies higher concordance between X and Y.
3.3. Purity
The purity [22] is another simple and transparent evaluation metric for evaluating the clustering performance. For purity, each identified cluster is assigned to the label which is most frequent in the cluster, and then the accuracy of this label assignment is computed by counting the number of correctly assigned cells and dividing by the number of cells N. This mapping is not one-to-one and may be biased to the class which has the largest size. Nonetheless, it provides us a simple metric.
For two clustering results X and Y containing n elements, the purity of them is calculated in terms of the following formula.
The purity ranges from 0 to 1 while 1 is the optimal score, which denotes that X and Y have identical clustering accuracy. A larger purity signifies higher concordance between X and Y.
4. Dataset and Preprocessing
4.1. Dataset
In the study, 33 publicly available scRNA-seq datasets are utilized to testify the performance of the IGRDCGA, which are shown in Table 1. The datasets contain various single-cell gene expression data that are published by many different publications. Each row in the datasets denotes an observation or cell while each column denotes a feature or gene. Table 1 shows the features of the datasets, such as GSE, name, the number of cells (#cell), the number of genes (#gene), the number of the ground truth classes (#class), and references.
Table 1
ScRNA-seq datasets and their features.
GSE/ID | Name | # Cell | # Gene | #Class | References |
GSE74543 | Allodiploid | 16 | 2406 | 2 | Li et al., 2016 [23] |
GSE42268 | Sasagawa | 23 | 32700 | 3 | Sasagawa et al., 2013 [24] |
GSE38495 | Ramskold | 33 | 21042 | 7 | Ramskold et al., 2012 [25] |
GSE57249 | Biase | 56 | 25734 | 4 | Biase et al., 2014 [26] |
GSE51372 | Ting | 187 | 21583 | 7 | Ting et al., 2014 [27] |
GSE75688 | Chung | 518 | 41821 | 4 | Chung et al., 2017 [28] |
GSE85908 | Yeo | 214 | 27473 | 4 | Yeo et al., 2017 [29] |
GSE64016 | Ning | 247 | 19080 | 3 | Ling et al., 2015 [30] |
GSE60783 | Ginhoux | 251 | 11834 | 3 | Schlitzer et al., 2015 [31] |
GSE87795 | Su | 367 | 15333 | 6 | Su et al., 2017 [32] |
GSE59739 | Usoskin | 622 | 25334 | 4 | Usoskin et al., 2015 [33] |
GSE45719 | Deng | 268 | 22431 | 6 | Deng et al., 2014 [34] |
GSE53386 | Fan | 66 | 26357 | 6 | Fan et al., 2015 [35] |
E-MTAB-3321 | Goolam | 124 | 40315 | 5 | Goolam et al., 2016 [36] |
E-MTAB-2600 | Kolodz | 704 | 13473 | 3 | Kolodziejczyk et al., 2015 [37] |
GSE81682 | Nestorowa | 1656 | 4773 | 9 | Nestorowa et al., 2016 [38] |
GSE52583 | Treutlein | 80 | 23271 | 5 | Treutlein et al., 2014 [39] |
GSE75140 | Camp15 | 734 | 18927 | 6 | Camp et al., 2015 [40] |
GSE81252 | Camp17 | 777 | 19020 | 7 | Camp et al., 2017 [41] |
GSE36552 | Yan | 90 | 20214 | 6 | Yan et al., 2013 [42] |
GSE83139 | Wang | 635 | 19950 | 8 | Wang et al., 2016 [43] |
GSE81861 | Li1 | 561 | 55186 | 9 | Li et al., 2017 [44] |
GSE73727 | Li2 | 60 | 180253 | 6 | Li et al., 2016 [45] |
GSE57872 | Patel | 430 | 5948 | 5 | Patel et al., 2014l [46] |
SRP041736 | Pollen | 301 | 23730 | 11 | Alex et al., 2014 [47] |
GSE76381 | Manno_m | 2150 | 24378 | 32 | La Manno et al.2016 [48] |
GSE76381 | Manno_h | 4029 | 20560 | 56 | La Manno et al.2016 [48] |
GSE71585 | Tasic | 1679 | 24057 | 18 | Tasic et al., 2016 [49] |
GSE60361 | Zeisel | 3005 | 19486 | 9 | Amit 2015 [50] |
GSE76983 | Grun | 1502 | 23536 | 2 | Grun et al., 2016 [51] |
GSE84133 | Baron | 1886 | 14878 | 13 | Baron et al., 2016 [52] |
GSE85241 | Muraro | 2126 | 19127 | 10 | Muraro et al., 2016 [53] |
GSE81608 | Xin | 1600 | 39851 | 8 | Xin et al., 2016 [54] |
4.2. Data Preprocessing
For a scRNA-seq dataset containing N cells and M genes, if
To lower the impacts of large gene expression levels on little gene expression levels, all the data in
It is obvious that all the elements are normalized according to each column (each column denotes a gene). Therefore, after preprocessing, the gene expression level of each gene ranges from 0 to 1, and all the elements in D range from 0 to 1 as well.
5. The Proposed Algorithm
In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data by combining information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA for short). The coding and the other details of the IGRDCGA are as follows.
5.1. Coding and Initialization
To choose high quality genes from a huge number of genes, we need to design a good coding for each chromosome. For the scRNA-seq dataset containing N cells of M genes, we design a coding of variable length, whose length can be changed. We number M genes into 1∼M; the coding of a chromosome can be expressed as follows:
Obviously, the coding of a chromosome signifies a combination of genes, namely, a result of gene selection. Therefore, different chromosomes can signify different gene selection, and coding of variable length can signify gene selection of variable length.
According to the above coding rules of a chromosome, Algorithm 1 presents the initialization algorithm to generate the initial population whose population size is Npop.
Algorithm 1: Initialization of population.
(1) Generate one random integer l in the interval [1, M].
(2) Generate l random integers in the interval [1, M].
(3) Loop the above (1) and (2) Npop times to obtain the initial population.
5.2. Crossover Operator
As the length of the coding of a chromosome is variable, we need to design a new crossover operator to fit the coding of variable length. Given the coding of two parents C1 and C2, their lengths are, respectively, l1 and l2, and the minimum value of l1 and l2 is marked as l3. The number of the genes to exchange in crossover operator should be less than or equal to l3. The study presents a new crossover operator as shown in Algorithm 2.
Algorithm 2: Crossover operator.
(1) Generate one random integer l4 that is less than or equal to l3.
(2) For C1, generate the location to exchange loc1, which contains l4 different random integers that are larger than or equal to 1 and less than or equal to l1.
(3) For C2, generate the location to exchange loc2, which contains l4 different random integers that are larger than or equal to 1 and less than or equal to l2.
(4) The coding bits of the location loc1 in C1 and the coding bits the location loc2 in C2 are exchanged among each other.
5.3. Mutation Operator
Similarly, we also need to design a new mutation operator to fit the coding of variable length. Given the coding of one parents C3, its length is l5. The location of the genes to mutate in mutation operator should be less than or equal to l5. The study presents a new mutation operator as shown in Algorithm 3.
Algorithm 3: Mutation operator.
(1) Generate one random integer l6 that is less than or equal to l5.
(2) For C3, generate the location to mutate loc3, which contains l6 different random integers that are larger than or equal to 1 and less than or equal to l5.
(3) The coding bits of the location loc3 in C3 are replaced into the other integers that are not equal to any of the coding bits of C3.
5.4. Detailed Steps of IGRDCGA
To illustrate the main process of the proposed algorithm, Figure 1 shows the flow chart of the IGRDCGA.
[figure omitted; refer to PDF]
The detailed steps of the IGRDCGA are summarized as follows.
Step (1). Compute information gain ratio IGR of each gene in terms of formula (5) first. Then, eliminate those genes whose IGR is 0.
Step (2). To eliminate irrelevant genes and choose high quality genes, the threshold method is used to choose those genes whose IGR is the top best
Step (3). Implement Algorithm 1 described in Section 5.1 to obtain the initial population P1.
Step (4). Compute the fitness and obtain the best chromosome.
(1) For each chromosome pi in P1, select those genes determined by pi to obtain a new data
(2) Obtain the best chromosome
Step (5). Let the number of iterations
Step (6). Tournament selection strategy [55, 56] is used to choose
Step (7). Generate a random r. If r is less than or equal to the crossover probability
Step (8). Generate a random r. If r is less than or equal to the mutation probability
Step (9). Compute the fitness value of each chromosome in P4 and obtain the best chromosome
Step (10). Turn to Step 5.
5.5. Time Complexity Analysis of IGRDCGA
Suppose that a scRNA-seq dataset contains N cells and M genes, and the population size in GA is Npop. The time complexity of the IGRDCGA is analyzed as follows.
The time complexity of Step 1 is O (M), the time complexity of Step 2 is O (1), and the time complexity of Algorithm 1 in Step 3 is O (Npop). In Step 4, the time complexity of the k means clustering is O (N), and that of obtaining the best chromosome is also O (Npop). Step 5 to Step 10 are the main iteration process of the IGRDCGA; their time complexities determine the time complexity of the IGRDCGA. The execution count of Steps 5 and 6 is both tmax (the maximal number of iterations), and the time complexities are both O (tmax). In Step 7, the time complexity of selecting two chromosomes is O (tmax), while the counts of executions of the crossover operator are l3 (the minimum value of the lengths of two chromosome codes), and its time complexity is still O (l3
To sum up, the time complexity of the IGRDCGA involves the following time complexities: O (1), O (M), O (Npop), O (tmax), O (l3
6. Numerical Results
In order to evaluate the performances of the IGRDCGA, two frequently used clustering algorithms, k means and spectral clustering [57], a state-of-the-art single-cell classification algorithm SIMLR [58], are employed to compare it. To compare the performance of gene selection of the IGRDCGA, two frequently used dimensionality reduction algorithms, the PCA and tSNE, are also utilized to compare the IGRDCGA. For the SIMLR, we use its MATLAB program, which can be accessed by the address: https://github.com/BatzoglouLabSU/SIMLR. For the k means, PCA, and tSNE, the study utilizes the built-in functions in MATLAB.
6.1. Parameter Values
The related parameters of the IGRDCGA are described as follows.
(i) Parameter for IGR. The threshold
(ii) Terminal Condition for the IGRDCGA. The population size
The other competing algorithms use their default parameter values.
6.2. Results
6.2.1. Comparisons of the IGR
The IGR of each gene shows its relevance. The IGR ranges from 0 to 1 while 0 represents that the gene has no relevance for the classification. Therefore, this can be employed to obviate irrelevant genes roughly. We compute the IGR of each gene for all scRNA-seq datasets and obtain the number of the irrelevant genes (#irrelevant genes) and their percentages in respect of the total genes. The obtained results are shown in Table 2.
Table 2
Irrelevant genes and their percentage.
Dataset | #Of genes | #Irrelevant genes | Percentage (%) |
Allodiploid | 2406 | 0 | 0 |
Sasagawa | 32700 | 15412 | 47.13 |
Ramskold | 21042 | 1005 | 4.78 |
Biase | 25734 | 51 | 0.2 |
Ting | 21583 | 4349 | 20.15 |
Chung | 41821 | 14217 | 33.99 |
Yeo | 27473 | 4042 | 14.71 |
Ning | 19080 | 1962 | 10.28 |
Ginhoux | 11834 | 0 | 0 |
Su | 15333 | 0 | 0 |
Usoskin | 25334 | 5800 | 22.89 |
Deng | 22431 | 1257 | 5.6 |
Fan | 26357 | 941 | 3.57 |
Goolam | 40315 | 12592 | 31.23 |
Kolodz | 13473 | 0 | 0 |
Nestorowa | 4773 | 0 | 0 |
Treutlein | 23271 | 10497 | 45.11 |
Camp15 | 18927 | 2135 | 11.28 |
Camp17 | 19020 | 1285 | 6.76 |
Yan | 20214 | 619 | 3.06 |
Wang | 19950 | 0 | 0 |
Li1 | 55186 | 12131 | 21.98 |
Li2 | 180253 | 40335 | 22.38 |
Patel | 5948 | 0 | 0 |
Pollen | 23730 | 1920 | 8.09 |
Manno_m | 24378 | 2390 | 9.8 |
Manno_h | 20560 | 0 | 0 |
Tasic | 24057 | 2443 | 10.16 |
Zeisel | 19486 | 0 | 0 |
Grun | 23536 | 7928 | 33.68 |
Baron | 14878 | 17 | 0.11 |
Muraro | 19127 | 225 | 1.18 |
Xin | 39851 | 5962 | 14.96 |
From Table 2, we can obviously see that there exist irrelevant genes in most datasets. The percentage of irrelevant genes versus the total genes can indicate gene redundancy rates. There are 24 datasets whose gene redundancy rate is larger than 0 in Table 2, which account for 72.72% (total 33 datasets). This demonstrates that the IGR = 0 is a good way to obviate the irrelevant genes.
However, Table 2 also shows that there are 9 datasets whose redundancy rate is 0. Namely, it cannot find the irrelevant genes from the 9 datasets by means of the IGR = 0. This also demonstrates that the IGR = 0 can only determine irrelevant genes and cannot determine irrelevant genes. Therefore, our proposed algorithm IGRDCGA utilizes a threshold method to eliminate irrelevant genes.
Comparatively speaking, the datasets containing more genes possess higher redundancy rates of genes and vice versa. Nevertheless, this is not absolute, as can be shown in Table 2.
6.2.2. Comparisons of Evaluation Metrics
We perform the IGRDCGA and several competing algorithms on the above scRNA-seq datasets described in Section 4.1. Three evaluation metrics NMI, ARI, and purity are employed to evaluate the performances of the IGRDCGA and the other four competing algorithms. All the algorithms are independently performed for 20 runs to obtain their average values. For all 33 datasets, the average values of NMI, ARI, and purity metrics are shown in Tables 3–5, respectively.
Table 3
NMI metric obtained by the IGRDCGA and several competing algorithms.
Dataset | K means | Spectral | SIMLR | IGRDCGA | PCA | tSNE |
Allodiploid | 0.3793 | 0.0778 | 1.0000 | 1.0000 | 1.0000 | 0.0490 |
Sasagawa | 0.3048 | 0.1815 | 0.4665 | 0.9128 | 0.6349 | 0.0903 |
Ramskold | 0.6868 | 0.4946 | 0.7594 | 0.9709 | 0.9394 | 0.3444 |
Biase | 0.6690 | 0.5114 | 0.8321 | 1.0000 | 0.9501 | 0.0544 |
Ting | 0.2161 | 0.1211 | 0.5082 | 0.5778 | 0.5746 | 0.0585 |
Chung | 0.1280 | 0.0295 | 0.3170 | 0.4196 | 0.2278 | 0.3336 |
Yeo | 0.2358 | 0.0480 | 0.7744 | 0.8168 | 0.7650 | 0.6831 |
Ning | 0.1100 | 0.0388 | 0.3465 | 0.7432 | 0.3579 | 0.0073 |
Ginhoux | 0.4094 | 0.0326 | 0.4189 | 0.5580 | 0.4777 | 0.0074 |
Su | 0.3145 | 0.0676 | 0.4963 | 0.5951 | 0.3910 | 0.0201 |
Usoskin | 0.1823 | 0.0363 | 0.4552 | 0.8334 | 0.6235 | 0.6579 |
Deng | 0.7451 | 0.3860 | 0.6979 | 0.8977 | 0.7671 | 0.0334 |
Fan | 0.2939 | 0.1910 | 0.4925 | 0.5874 | 0.5249 | 0.1417 |
Goolam | 0.2826 | 0.1917 | 0.4464 | 0.7428 | 0.3282 | 0.0527 |
Kolodz | 0.7024 | 0.0235 | 0.9634 | 1.0000 | 0.9790 | 0.0035 |
Nestorowa | 0.3499 | 0.1983 | 0.3514 | 0.3774 | 0.3562 | 0.3795 |
Treutlein | 0.2520 | 0.2585 | 0.4574 | 0.8074 | 0.5445 | 0.0754 |
Camp15 | 0.4025 | 0.0596 | 0.5098 | 0.4945 | 0.4257 | 0.4979 |
Camp17 | 0.6479 | 0.0735 | 0.7420 | 0.7344 | 0.7366 | 0.0117 |
Yan | 0.7034 | 0.4988 | 0.7937 | 0.9480 | 0.8830 | 0.0872 |
Wang | 0.0609 | 0.0613 | 0.2588 | 0.3038 | 0.2226 | 0.3274 |
Li1 | 0.4852 | 0.0634 | 0.7377 | 0.7623 | 0.7168 | 0.0275 |
Li2 | 0.2117 | 0.1807 | 0.3387 | 0.5915 | 0.3452 | 0.1425 |
Patel | 0.1669 | 0.0429 | 0.6446 | 0.7436 | 0.7349 | 0.8248 |
Pollen | 0.5846 | 0.1790 | 0.9272 | 0.9008 | 0.9055 | 0.8886 |
Manno_m | 0.3084 | 0.0887 | 0.4026 | 0.3979 | 0.4065 | 0.4710 |
Manno_h | 0.3796 | 0.6052 | 0.4867 | 0.4849 | 0.5287 | 0.5927 |
Tasic | 0.6758 | 0.7707 | 0.7456 | 0.8458 | 0.8039 | 0.8180 |
Zeisel | 0.4681 | 0.6586 | 0.7097 | 0.6076 | 0.5229 | 0.7391 |
Grun | 0.0059 | 0.1145 | 0.4133 | 0.3729 | 0.0055 | 0.0005 |
Baron | 0.2079 | 0.5731 | 0.5950 | 0.5480 | 0.4281 | 0.6618 |
Muraro | 0.2930 | 0.6647 | 0.7001 | 0.6897 | 0.4189 | 0.0095 |
Xin | 0.2587 | 0.4237 | 0.2422 | 0.7201 | 0.5445 | 0.0100 |
Table 4
ARI metric obtained by the IGRDCGA and several competing algorithms.
Dataset | K means | Spectral | SIMLR | IGRDCGA | PCA | tSNE |
Allodiploid | 0.2767 | −0.0500 | 1.0000 | 1.0000 | 1.0000 | −0.0125 |
Sasagawa | 0.1322 | 0.0262 | 0.4306 | 0.9023 | 0.5238 | −0.0037 |
Ramskold | 0.3999 | 0.1379 | 0.5358 | 0.9481 | 0.8590 | 0.0024 |
Biase | 0.5343 | 0.2827 | 0.7346 | 1.0000 | 0.9556 | −0.0118 |
Ting | 0.0985 | 0.0196 | 0.3019 | 0.3905 | 0.3044 | 0.0002 |
Chung | 0.0420 | −0.0021 | 0.2011 | 0.2059 | 0.1222 | 0.2216 |
Yeo | 0.1408 | −0.0068 | 0.7760 | 0.7964 | 0.7675 | 0.5966 |
Ning | 0.0608 | 0.0007 | 0.3269 | 0.7456 | 0.3137 | −0.0004 |
Ginhoux | 0.3753 | −0.0016 | 0.3379 | 0.6062 | 0.4790 | −0.0002 |
Su | 0.3035 | 0.0084 | 0.3329 | 0.5590 | 0.3180 | 0.0000 |
Usoskin | 0.1033 | 0.0020 | 0.3007 | 0.8343 | 0.5319 | 0.5538 |
Deng | 0.7033 | 0.1435 | 0.5059 | 0.8786 | 0.5460 | 0.0011 |
Fan | 0.1156 | 0.0386 | 0.3323 | 0.4049 | 0.3529 | 0.0060 |
Goolam | 0.1436 | 0.0656 | 0.2284 | 0.7564 | 0.1433 | 0.0005 |
Kolodz | 0.6468 | 0.0010 | 0.9762 | 1.0000 | 0.9880 | 0.0011 |
Nestorowa | 0.2236 | 0.0735 | 0.2166 | 0.2572 | 0.2284 | 0.2584 |
Treutlein | 0.1204 | 0.0741 | 0.2615 | 0.8138 | 0.4066 | −0.0038 |
Camp15 | 0.3614 | 0.0141 | 0.4402 | 0.4835 | 0.3537 | 0.4121 |
Camp17 | 0.4667 | 0.0031 | 0.4859 | 0.6025 | 0.5641 | −0.0005 |
Yan | 0.5801 | 0.2894 | 0.6304 | 0.9425 | 0.8453 | −0.0045 |
Wang | 0.0084 | 0.0025 | 0.1445 | 0.1239 | 0.1111 | 0.1926 |
Li1 | 0.2265 | 0.0002 | 0.6032 | 0.6023 | 0.5354 | −0.0002 |
Li2 | 0.0280 | 0.0111 | 0.1496 | 0.3423 | 0.1793 | 0.0014 |
Patel | 0.0581 | −0.0012 | 0.6065 | 0.7383 | 0.7424 | 0.8437 |
Pollen | 0.2423 | 0.0177 | 0.8436 | 0.8258 | 0.8326 | 0.7889 |
Manno_m | 0.0721 | 0.0024 | 0.1317 | 0.1351 | 0.1328 | 0.1822 |
Manno_h | 0.0774 | 0.3050 | 0.1489 | 0.1593 | 0.2006 | 0.2767 |
Tasic | 0.4326 | 0.6113 | 0.6176 | 0.8090 | 0.6889 | 0.6716 |
Zeisel | 0.3023 | 0.4970 | 0.5730 | 0.4676 | 0.3418 | 0.5866 |
Grun | −0.0300 | 0.0730 | 0.4375 | 0.5053 | −0.0355 | −0.0001 |
Baron | 0.0516 | 0.2526 | 0.2887 | 0.3202 | 0.1450 | 0.3251 |
Muraro | 0.1306 | 0.4838 | 0.5716 | 0.6391 | 0.2261 | −0.0002 |
Xin | 0.2802 | 0.2389 | 0.0771 | 0.8145 | 0.3374 | 0.0000 |
Table 5
Purity metric obtained by the IGRDCGA and several competing algorithms.
Dataset | K means | spectral | SIMLR | IGRDCGA | PCA | tSNE |
Allodiploid | 0.2767 | −0.0500 | 1.0000 | 1.0000 | 1.0000 | −0.0125 |
Sasagawa | 0.0387 | −0.2253 | 0.5020 | 0.9170 | 0.5731 | 0.1462 |
Ramskold | 0.5953 | 0.1801 | 0.7386 | 0.9756 | 0.9331 | 0.5491 |
Biase | 0.5124 | 0.1796 | 0.7896 | 1.0000 | 0.9636 | 0.2149 |
Ting | −0.1852 | −0.4401 | 0.5643 | 0.4239 | 0.5204 | 0.3951 |
Chung | −0.1686 | −0.4338 | 0.3331 | 0.1288 | 0.2860 | 0.3774 |
Yeo | −0.0742 | −0.4393 | 0.8247 | 0.8352 | 0.8196 | 0.6856 |
Ning | −0.1955 | −0.3228 | 0.3667 | 0.7629 | 0.3466 | 0.1128 |
Ginhoux | 0.3983 | −0.3019 | 0.3912 | 0.6425 | 0.5308 | 0.1045 |
Su | 0.4949 | −0.4755 | 0.5004 | 0.7007 | 0.5594 | 0.3739 |
Usoskin | −0.1365 | −0.4286 | 0.3846 | 0.8584 | 0.6343 | 0.6506 |
Deng | 0.7271 | −0.0730 | 0.6137 | 0.8920 | 0.6378 | 0.2573 |
Fan | 0.1441 | −0.2779 | 0.5980 | 0.6421 | 0.6073 | 0.4231 |
Goolam | 0.1340 | −0.1726 | 0.3632 | 0.7725 | 0.2284 | 0.1841 |
Kolodz | 0.5854 | −0.2917 | 0.9783 | 1.0000 | 0.9891 | 0.1003 |
Nestorowa | 0.5335 | −0.0128 | 0.6124 | 0.6035 | 0.6294 | 0.6444 |
Treutlein | −0.0020 | −0.1692 | 0.4146 | 0.8372 | 0.5156 | 0.2137 |
Camp15 | 0.4989 | −0.4715 | 0.6333 | 0.6594 | 0.5855 | 0.6197 |
Camp17 | 0.6779 | −0.6129 | 0.6976 | 0.7836 | 0.7586 | 0.4854 |
Yan | 0.6861 | 0.3208 | 0.7748 | 0.9620 | 0.9001 | 0.3921 |
Wang | −0.3533 | −0.5255 | 0.4835 | 0.2183 | 0.4099 | 0.5145 |
Li1 | 0.2909 | −0.7039 | 0.8271 | 0.8050 | 0.7802 | 0.5820 |
Li2 | −0.2315 | 0.4114 | 0.4866 | 0.4227 | 0.3781 | 0.4231 |
Patel | −0.3625 | −0.5657 | 0.7379 | 0.8244 | 0.8327 | 0.8982 |
Pollen | 0.2559 | −0.5278 | 0.9414 | 0.9271 | 0.9369 | 0.9210 |
Manno_m | 0.4832 | −0.8098 | 0.8236 | 0.7025 | 0.8149 | 0.8482 |
Manno_h | 0.6085 | 0.9307 | 0.8803 | 0.8370 | 0.9099 | 0.9268 |
Tasic | 0.6178 | 0.8780 | 0.8747 | 0.9231 | 0.8948 | 0.8939 |
Zeisel | 0.5849 | 0.7180 | 0.7558 | 0.6948 | 0.6250 | 0.7646 |
Grun | 0.2741 | 0.0728 | 0.4554 | 0.6157 | 0.2331 | −0.0003 |
Baron | 0.1026 | 0.5293 | 0.5466 | 0.5121 | 0.4274 | 0.5723 |
Muraro | 0.2705 | 0.7018 | 0.7430 | 0.7552 | 0.4868 | 0.4388 |
Xin | 0.1638 | 0.3518 | 0.2060 | 0.8166 | 0.4312 | 0.1518 |
From Table 3, we can obviously see that, for 22 of 33 datasets, the IGRDCGA gains the largest NMI in six algorithms, which account for 66.67% (22 of 33 datasets). Meanwhile, we can also observe that, for 6 of the rest 12 datasets, the differences of the largest NMI and the NMI obtained by the IGRDCGA are very little. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest NMI is, respectively, 0, 1, 6, 2, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of NMI.
Table 4 shows that, for 24 of 33 datasets, the IGRDCGA acquires the maximal ARI in six algorithms, which account for 72.72%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the maximal ARI is, respectively, 0, 0, 3, 1, and 7. By contrast, the IGRDCGA is superior to the other five competing algorithms in terms of ARI.
From Table 5, it can be clearly seen that, for 21 of 33 datasets, the purity metrics obtained by the IGRDCGA are the largest in six algorithms, which account for 63.63%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest purity metrics is, respectively, 0, 1, 6, 1, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of purity metric.
By comparing Tables 3–5, we summarize the best evaluation metrics obtained by six algorithms in Table 6, where the NMI, ARI, and purity metrics are, respectively, denoted by 1, 2, and 3.
Table 6
The best evaluation metrics obtained by six algorithms.
Dataset | K means | spectral | SIMLR | IGRDCGA | PCA | tSNE |
Allodiploid | 1, 2, 3 | 1, 2, 3 | 1, 2, 3 | |||
Sasagawa | 1, 2, 3 | |||||
Ramskold | 1, 2, 3 | |||||
Biase | 1, 2, 3 | |||||
Ting | 3 | 1, 2 | ||||
Chung | 3 | 1, 2 | ||||
Yeo | 1, 2, 3 | |||||
Ning | 1, 2, 3 | |||||
Ginhoux | 1, 2, 3 | |||||
Su | 1, 2, 3 | |||||
Usoskin | 1, 2, 3 | |||||
Deng | 1, 2, 3 | |||||
Fan | 1, 2, 3 | |||||
Goolam | 1, 2, 3 | |||||
Kolodz | 1, 2, 3 | |||||
Nestorowa | 1, 2, 3 | |||||
Treutlein | 1, 2, 3 | |||||
Camp15 | 1 | 2, 3 | ||||
Camp17 | 1 | 2, 3 | ||||
Yan | 1, 2, 3 | |||||
Wang | 1, 2, 3 | |||||
Li1 | 2, 3 | 1 | ||||
Li2 | 3 | 1, 2 | ||||
Patel | 1, 2, 3 | |||||
Pollen | 1, 2, 3 | |||||
Manno_m | 1, 2, 3 | |||||
Manno_h | 1, 3 | 2 | ||||
Tasic | 1, 2, 3 | |||||
Zeisel | 1, 2, 3 | |||||
Grun | 1 | 2, 3 | ||||
Baron | 1, 2, 3 | |||||
Muraro | 1 | 2, 3 | ||||
Xin | 1, 2, 3 |
From Table 6, we can clearly see that, for 15 of 33 datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are all the best in six algorithms, which account for 45.45%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the best NMI, ARI and purity metrics is, respectively, 0, 0, 2, 1, and 6. Thus, the IGRDCGA outperforms the other five competing algorithms in terms of NMI, ARI, and purity metrics.
In the meantime, Table 6 shows that only in partial datasets does the IGRDCGA gain the best NMI, ARI, and purity metrics. For the Allodiploid, the IGRDCGA, SIMLR, and PCA obtain the best NMI, ARI, and purity metrics. We can observe from Table 1 that the Allodiploid possesses much less number of cells and genes compared with the other datasets. Namely, for the datasets with smaller dimensions, an algorithm is easy to obtain the best NMI, ARI, and purity metrics. In this case, the PCA is the fittest method as it is the simplest and easiest to implement in the above three algorithms. For the Ting, Chung, and Li2, the IGRDCGA obtains the best NMI and ARI, while the SIMLR obtains the best purity metrics. For the Camp15, Camp17, Grun, and Muraro, the IGRDCGA obtains the best ARI and purity metrics, while the SIMLR obtains the best NMI. By the comparisons of the IGRDCGA with the SIMLR, we can clearly see that the number of the best metrics obtained by the IGRDCGA is far larger than that obtained by the SIMLR. This fully demonstrates that the IGRDCGA is superior to the SIMLR in terms of NMI, ARI, and purity metrics.
For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron datasets, the tSNE gains the best NMI, ARI, and purity metrics. However, for the other datasets, three evaluation metrics obtained by the tSNE are far worse than those obtained by the IGRDCGA. As shown in Table 3, there are 17 datasets whose NMI of the tSNE is less than 0.1, while there are none datasets whose NMI of the IGRDCGA is less than 0.1. In Table 4, there are 20 datasets whose ARI of the tSNE is less than 0.1, while there are none datasets whose ARI of the IGRDCGA is less than 0.1. In Table 5, there are 19 datasets whose purity metrics of the tSNE are less than 0.5, while there are only 4 datasets whose purity metrics of the IGRDCGA are less than 0.5. Namely, for most datasets except the above six datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are superior to those obtained by the tSNE. This attests that the IGRDCGA outperforms the tSNE for most datasets in terms of NMI, ARI, and purity metrics.
By elaborative analyses for the above six datasets and the other datasets, we can clearly observe the following principal features and differences. To begin with, the six datasets all possess very low redundancy rates of genes. Table 2 shows that the redundancy rates of the Zeisel and Baron are, respectively, 9.8 and 0.11, and the other four datasets are all 0. For the datasets with low redundancy rates, the IGRDCGA may lose some genes to cause the decreasing of the classification performances. In addition, the six datasets have relatively low NMI and ARI for six algorithms. From Table 3, we can easily see that all the NMI metrics of the Nestorowa, Wang, and Manno_m are, respectively, less than 0.38, 0.33, and 0.48. From Table 4, it can be obviously seen that all the ARI metrics of the Nestorowa, Wang, Manno_m, and Baron are, respectively, less than 0.26, 0.20, 0.19, and 0.33. Thirdly, the six datasets are approximately and completely sparse. From our provided appendix file, it can obviously observe that the data values within [0, 0.1] in all datasets are more than those of the other data intervals. Nevertheless, the situation of the six datasets is a great deal more highlighted, and they are approximately and completely sparse if we let the data values within [0, 0.1] to be 0. For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron, the data values within [0, 0.1], respectively, account for more than 88%, 95%, 73%, 91%, 80%, and 91%, which are so high that they are nearly cover up the data values within the other ranges.
6.2.3. Iteration Plots and Heat Maps
In order to clearly illustrate the performance of the IGRDCGA, three scRNA-seq datasets, Allodiploid, Chung, and Kolodz, are selected as representative datasets according to the different values of the NMI metrics in Table 3. For the Allodiploid, Yeo, and Grun, their iteration plots are, respectively, illustrated in Figures 2–4. For the Allodiploid, the NMI metric obtained by the IGRDCGA is 1, which represents that the NMI metrics are consistent with the ground truth class labels. It can be clearly seen from Figure 2 that the iteration plot of the Allodiploid is parallel to the X-axis and its value is the maximal NMI of 1. Namely, the IGRDCGA obtains the maximal NMI at the first iteration. Therefore, the later iterations are useless. In the above 33 scRNA-seq datasets, only the Allodiploid displays the plot illustrated in Figure 2.
[figure omitted; refer to PDF]
In Figure 3, the iteration plot of Yeo is ascending as the number of iterations increases. The ascending trend is very obvious at the early period of the iteration, and the ascending trend stops as the number of iterations increases. Namely, the NMI metric turns into a fixed number at the later period of the iteration. In the above 33 scRNA-seq datasets, most datasets display a similar plot shown in Figure 3.
From Figure 4, we can obviously see that the iteration plot of Grun is always ascending as the number of iterations increases. The ascending trend is always very obvious at the whole period of the iteration. In the above 33 scRNA-seq datasets, a few datasets display a similar plot shown in Figure 4.
In order to illustrate the classification performance of the IGRDCGA more visually, the heat maps of the Allodiploid, Sasagawa, and Yan are illustrated in Figures 5–7, respectively.
[figure omitted; refer to PDF]
The Allodiploid contains 16 cells, 2406 genes, and 2 classes, whose ground truth class labels are illustrated in the first column of Figure 5. From Figure 5, it can be obviously seen that the results obtained by the IGRDCGA, SIMLR, and PCA are fully concordant with the ground truth class labels; those obtained by the k means and spectral clustering are nearly one class; and those obtained by the tSNE are oscillatory. Obviously, the results obtained by the k means, spectral clustering, and the tSNE are a great deal worse than those obtained by the IGRDCGA, SIMLR, and PCA; the results of the tSNE are the worst in the six algorithms.
The dataset Sasagawa contains 23 cells, 32700 genes, and 3 classes, whose ground truth labels are illustrated in the first column of Figure 6. From Figure 6, it can be obviously observed that the results obtained by the IGRDCGA are fully concordant with the ground truth class labels; those obtained by spectral clustering are nearly one class; those obtained by the tSNE are oscillatory and the worst. In the results obtained by the k means, SIMLR, and PCA, the results of the subclass 2 (the ground truth class labels are 2) are fully concordant with the ground truth class labels while the other subclasses are confused. Therefore, for the Sasagawa, the IGRDCGA gains the best results while the tSNE gains the worst results.
The first column of Figure 7 illustrates the ground truth labels of the Yan, which contains 90 cells, 20214 genes, and 6 classes. From Figure 7, we can clearly see that none of the results obtained by six algorithms are concordant with the ground truth class labels. However, the results obtained by the IGRDCGA are closer to the ground truth class labels than those obtained by the other algorithms, and the results of the subclasses 2, 3, 4, 5, and 6 are fully concordant with the ground truth class labels. The results obtained by the tSNE are oscillatory and the worst in the six algorithms. The results obtained by the other algorithms are all confused. Therefore, for the Yan, the IGRDCGA obtains the best results whereas the tSNE obtains the worst results.
7. Conclusion and Future Work
In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data. It combines information gain ratio (IGR) and genetic algorithm with dynamic crossover (DCGA) and are abbreviated as IGRDCGA. It utilizes information gain ratio to eliminate irrelevant genes roughly and utilizes DCGA to choose high quality genes finely. We have conducted the IGRDCGA and several competing algorithms on 33 publicly available scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can eliminate irrelevant genes and choose high quality genes effectively, and it is superior to the other several competing algorithms in terms of classification accuracy.
This algorithm is going on for further enhancement and improvement. One attempt is to utilize a more efficient coding to speed up its converging rate and stability. Another attempt is to extend the IGRDCGA to classification algorithms of the other high dimensional problems.
Acknowledgments
This research was supported by Science Research Foundation for High-level Talents of Yulin Normal University (no. G2021ZK17).
[1] Y. Wang, H. Liu, F. Wei, T. Zong, X. Li, "Cooperative coevolution with formula-based variable grouping for large-scale global optimization," Evolutionary Computation, vol. 26 no. 4, pp. 569-596, DOI: 10.1162/evco_a_00214, 2018.
[2] B. Nakisa, M. N. Rastgoo, D. Tjondronegoro, V. Chandran, "Evolutionary computation algorithms for feature selection of EEG-based emotion recognition using mobile sensors," Expert Systems with Applications, vol. 93, pp. 143-155, DOI: 10.1016/j.eswa.2017.09.062, 2018.
[3] D. Y. Eroglu, K. Kilic, "A novel hybrid genetic local search algorithm for feature selection and weighting with an application in strategic decision making in innovation management," Information Sciences, vol. 405, pp. 18-32, DOI: 10.1016/j.ins.2017.04.009, 2017.
[4] N. Maleki, Y. Zeinali, "A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection," Expert Systems with Applications, vol. 164, 2020.
[5] M. Tahir, A. Tubaishat, F. Al-Obeidat, "A novel binary chaotic genetic algorithm for feature selection and its utility in affective computing and healthcare," Neural Computing and Applications,DOI: 10.1007/s00521-020-05347-y, 2020.
[6] A. H. Mohammad, "Comparing two feature selections methods (information gain and gain ratio) on three different classification algorithms using Arabic dataset," Journal of Theoretical & Applied Information Technology, vol. 96 no. 6, pp. 1561-1569, 2018.
[7] H. Uğuz, "A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm," Knowledge-Based Systems, vol. 24 no. 7, pp. 1024-1032, 2011.
[8] A. Chinnaswamy, R. Srinivasan, "Hybrid information gain based fuzzy roughset feature selection in cancer microarray data," Proceedings of the International Conference on Innovations in Power and Advanced Computing Technologies (I-PACT), .
[9] C. Dai, Y. Wang, M. Ye, X. Xue, H. Liu, "An orthogonal evolutionary algorithm with learning automata for multiobjective optimization," IEEE Transactions on Cybernetics, vol. 46 no. 12, pp. 3306-3319, DOI: 10.1109/TCYB.2015.2503433, 2016.
[10] X. Xue, Y. Wang, "Using memetic algorithm for instance coreference resolution," IEEE Transactions on Knowledge and Data Engineering, vol. 28 no. 2, pp. 580-591, DOI: 10.1109/tkde.2015.2475755, 2016.
[11] J. Liu, Y. Wang, N. Fan, S. Wei, W. Tong, "A convergence-diversity balanced fitness evaluation mechanism for decomposition- based many-objective optimization algorithm," Integrated Computer-Aided Engineering, vol. 26 no. 2, pp. 159-184, DOI: 10.3233/ica-180594, 2019.
[12] H. Liu, Y. Wang, L. Liu, X. Li, "A two phase hybrid algorithm with a new decomposition method for large scale optimization," Integrated Computer-Aided Engineering, vol. 25 no. 4, pp. 349-367, DOI: 10.3233/ica-170571, 2018.
[13] M. Ye, "A hybrid genetic algorithm for the minimum exposure path problem of wireless sensor networks based on a numerical functional extreme model," IEEE Transactions on Vehicular Technology, vol. 65 no. 10, pp. 8644-8657, 2015.
[14] X. Xue, Y. Wang, "Optimizing ontology alignments through a memetic algorithm using both MatchFmeasure and unanimous improvement ratio," Artificial Intelligence, vol. 223, pp. 65-81, DOI: 10.1016/j.artint.2015.03.001, 2015.
[15] D. Cai, Y. Wang, "A new decomposition based evolutionary algorithm with uniform designs for many-objective optimization," Applied Soft Computing, vol. 30 no. 1, pp. 238-248, 2015.
[16] Y.-M. Cheung, F. Gu, H.-L. Liu, "Objective extraction for many-objective optimization problems: algorithm and test problems," IEEE Transactions on Evolutionary Computation, vol. 20 no. 5, pp. 755-772, DOI: 10.1109/tevc.2016.2519758, 2016.
[17] P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, "Normalized mutual information feature selection," IEEE Transactions on Neural Networks, vol. 20 no. 2, pp. 189-201, DOI: 10.1109/TNN.2008.2005601, 2009.
[18] A. F. McDaid, D. Greene, N. Hurley, "Normalized mutual information to evaluate overlapping community finding algorithms," 2011. https://arxiv.org/abs/1110.2515
[19] O. Abedinia, N. Amjady, H. Zareipour, "A new feature selection technique for load and price forecast of electrical power systems," IEEE Transactions on Power Systems, vol. 32 no. 1, pp. 62-74, DOI: 10.1109/tpwrs.2016.2556620, 2017.
[20] C. Xu, Z. Su, "Identification of cell types from single-cell transcriptomes using a novel clustering method," Bioinformatics, vol. 31 no. 12, pp. 1974-1980, DOI: 10.1093/bioinformatics/btv088, 2015.
[21] V. Y. Kiselev, K. Kirschner, M. T. Schaub, T. Andrews, A. Yiu, T. Chandra, K. N. Natarajan, W. Reik, M. Barahona, A. R. Green, M. Hemberg, "SC3: consensus clustering of single-cell RNA-seq data," Nature Methods, vol. 14 no. 5, pp. 483-486, DOI: 10.1038/nmeth.4236, 2017.
[22] S. Wagner, D. Wagner, "Comparing clusterings: an overview," 2007.
[23] X. Li, X. L. Cui, J. Q. Wang, Y. K. Wang, Y. F. Li, L. Y. Wang, "Generation and application of mouse-rat allodiploid embryonic stem cells," Cell, vol. 164 no. 1-2, pp. 279-292, DOI: 10.1016/j.cell.2015.11.035, 2016.
[24] Y. Sasagawa, I. Nikaido, T. Hayashi, H. Danno, K. D. Uno, T. Imai, H. R. Ueda, "Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity," Genome Biology, vol. 14 no. 4,DOI: 10.1186/gb-2013-14-4-r31, 2013.
[25] Ramsköld, S. Luo, Y. C. Wang, R. Li, Q. Deng, O. R. Faridani, G. A. Daniels, I. Khrebtukova, J. F. Loring, L. C. Laurent, G. P. Schroth, R. Sandberg, "Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells," Nature Biotechnology, vol. 30 no. 8, pp. 777-782, DOI: 10.1038/nbt.2282, 2012.
[26] F. H. Biase, X. Cao, S. Zhong, "Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing," Genome Research, vol. 24 no. 11, pp. 1787-1796, DOI: 10.1101/gr.177725.114, 2014.
[27] D. T. Ting, B. S. Wittner, M. Ligorio, N. Vincent Jordan, A. M. Shah, D. T. Miyamoto, N. Aceto, F. Bersani, B. W. Brannigan, K. Xega, J. C. Ciciliano, H. Zhu, O. C. MacKenzie, J. Trautwein, K. S. Arora, M. Shahid, H. L. Ellis, N. Qu, N. Bardeesy, M. N. Rivera, V. Deshpande, C. R. Ferrone, R. Kapur, S. Ramaswamy, T. Shioda, M. Toner, S. Maheswaran, D. A. Haber, "Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells," Cell Reports, vol. 8 no. 6, pp. 1905-1918, DOI: 10.1016/j.celrep.2014.08.029, 2014.
[28] W. Chung, H. H. Eum, H.-O. Lee, K.-M. Lee, H.-B. Lee, K.-T. Kim, H. S. Ryu, S. Kim, J. E. Lee, Y. H. Park, Z. Kan, W. Han, W.-Y. Park, "Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer," Nature Communications, vol. 8 no. 1,DOI: 10.1038/ncomms15081, 2017.
[29] T. Yeo, S. J. Tan, C. L. Lim, D. P. X. Lau, Y. W. Chua, S. S. Krisna, G. Iyer, G. S. Tan, T. K. H. Lim, D. S. W. Tan, W.-T. Lim, C. T. Lim, "Microfluidic enrichment for the single cell analysis of circulating tumor cells," Scientific Reports, vol. 6 no. 1, pp. 22076-22087, DOI: 10.1038/srep22076, 2016.
[30] N. Leng, L. F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, C. Kendziorski, "Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments," Nature Methods, vol. 12 no. 10, pp. 947-950, DOI: 10.1038/nmeth.3549, 2015.
[31] A. Schlitzer, V. Sivakamasundari, J. Chen, H. R. B. Sumatoh, J. Schreuder, J. Lum, B. Malleret, S. Zhang, A. Larbi, F. Zolezzi, L. Renia, M. Poidinger, S. Naik, E. W. Newell, P. Robson, F. Ginhoux, "Identification of cDC1- and cDC2-committed DC progenitors reveals early lineage priming at the common DC progenitor stage in the bone marrow," Nature Immunology, vol. 16 no. 7, pp. 718-728, DOI: 10.1038/ni.3200, 2015.
[32] X. Su, Y. Shi, X. Zou, Z.-N. Lu, G. Xie, J. Y. H. Yang, C.-C. Wu, X.-F. Cui, K.-Y. He, Q. Luo, Y.-L. Qu, N. Wang, L. Wang, Z.-G. Han, "Single-cell RNA-Seq analysis reveals dynamic trajectories during mouse liver development," BMC Genomics, vol. 18 no. 1,DOI: 10.1186/s12864-017-4342-x, 2017.
[33] D. Usoskin, A. Furlan, S. Islam, H. Abdo, P. Lönnerberg, D. Lou, J. Hjerling-Leffler, J. Haeggström, O. Kharchenko, P. V. Kharchenko, S. Linnarsson, P. Ernfors, "Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing," Nature Neuroscience, vol. 18 no. 1, pp. 145-153, DOI: 10.1038/nn.3881, 2015.
[34] Q. Deng, D. Ramsköld, B. Reinius, R. Sandberg, "Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells," Science, vol. 343 no. 6167, pp. 193-196, DOI: 10.1126/science.1245316, 2014.
[35] X. Fan, X. Zhang, X. Wu, H. Guo, Y. Hu, F. Tang, Y. Huang, "Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos," Genome Biology, vol. 16 no. 1, pp. 148-217, DOI: 10.1186/s13059-015-0706-1, 2015.
[36] M. Goolam, A. Scialdone, S. J. L. Graham, I. C. Macaulay, A. Jedrusik, A. Hupalowska, T. Voet, J. C. Marioni, M. Zernicka-Goetz, "Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos," Cell, vol. 165 no. 1, pp. 61-74, DOI: 10.1016/j.cell.2016.01.047, 2016.
[37] A. A. Kolodziejczyk, J. K. Kim, J. C. H. Tsang, T. Ilicic, J. Henriksson, K. N. Natarajan, A. C. Tuck, X. Gao, M. Bühler, P. Liu, J. C. Marioni, S. A. Teichmann, "Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation," Cell Stem Cell, vol. 17 no. 4, pp. 471-485, DOI: 10.1016/j.stem.2015.09.011, 2015.
[38] S. Nestorowa, F. K. Hamey, B. Pijuan Sala, E. Diamanti, M. Shepherd, E. Laurenti, N. K. Wilson, D. G. Kent, B. Göttgens, "A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation," Blood, vol. 128 no. 8, pp. e20-e31, DOI: 10.1182/blood-2016-05-716480, 2016.
[39] B. Treutlein, D. G. Brownfield, A. R. Wu, N. F. Neff, G. L. Mantalas, F. H. Espinoza, T. J. Desai, M. A. Krasnow, S. R. Quake, "Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq," Nature, vol. 509 no. 7500, pp. 371-375, DOI: 10.1038/nature13173, 2014.
[40] J. G. Camp, F. Badsha, M. Florio, S. Kanton, T. Gerber, M. Wilsch-Bräuninger, E. Lewitus, A. Sykes, W. Hevers, M. Lancaster, J. A. Knoblich, R. Lachmann, S. Pääbo, W. B. Huttner, B. Treutlein, "Human cerebral organoids recapitulate gene expression programs of fetal neocortex development," Proceedings of the National Academy of Sciences, vol. 112 no. 51, pp. 15672-15677, DOI: 10.1073/pnas.1520760112, 2015.
[41] J. G. Camp, K. Sekine, T. Gerber, H. Loeffler-Wirth, H. Binder, M. Gac, S. Kanton, J. Kageyama, G. Damm, D. Seehofer, L. Belicova, M. Bickle, R. Barsacchi, R. Okuda, E. Yoshizawa, M. Kimura, H. Ayabe, H. Taniguchi, T. Takebe, B. Treutlein, "Multilineage communication regulates human liver bud development from pluripotency," Nature, vol. 546 no. 7659, pp. 533-538, DOI: 10.1038/nature22796, 2017.
[42] L. Yan, M. Yang, H. Guo, L. Yang, J. Wu, R. Li, P. Liu, Y. Lian, X. Zheng, J. Yan, J. Huang, M. Li, X. Wu, L. Wen, K. Lao, R. Li, J. Qiao, F. Tang, "Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells," Nature Structural & Molecular Biology, vol. 20 no. 9, pp. 1131-1139, DOI: 10.1038/nsmb.2660, 2013.
[43] Y. J. Wang, J. Schug, K.-J. Won, C. Liu, A. Naji, D. Avrahami, M. L. Golson, K. H. Kaestner, "Single-cell transcriptomics of the human endocrine pancreas," Diabetes, vol. 65 no. 10, pp. 3028-3038, DOI: 10.2337/db16-0405, 2016.
[44] H. Li, E. T. Courtois, D. Sengupta, Y. Tan, K. H. Chen, J. J. L. Goh, S. L. Kong, C. Chua, L. K. Hon, W. S. Tan, M. Wong, P. J. Choi, L. J. K. Wee, A. M. Hillmer, I. B. Tan, P. Robson, S. Prabhakar, "Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors," Nature Genetics, vol. 49 no. 5, pp. 708-718, DOI: 10.1038/ng.3818, 2017.
[45] L. Jin, "Single‐cell transcriptomes reveal characteristic features of human pancreatic islet cell types," EMBO Reports, vol. 17 no. 2, pp. 178-187, 2016.
[46] A. P. Patel, I. Tirosh, J. J. Trombetta, A. K. Shalek, S. M. Gillespie, H. Wakimoto, D. P. Cahill, B. V. Nahed, W. T. Curry, R. L. Martuza, D. N. Louis, O. Rozenblatt-Rosen, M. L. Suvà, A. Regev, B. E. Bernstein, "Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma," Science, vol. 344 no. 6190, pp. 1396-1401, DOI: 10.1126/science.1254257, 2014.
[47] A. Alex, "Pollen, Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex," Nature Biotechnology, vol. 32 no. 10, pp. 1053-1058, 2014.
[48] G. La Manno, D. Gyllborg, S. Codeluppi, K. Nishimura, C. Salto, A. Zeisel, L. E. Borm, S. R. W. Stott, E. M. Toledo, J. C. Villaescusa, P. Lönnerberg, J. Ryge, R. A. Barker, E. Arenas, S. Linnarsson, "Molecular diversity of midbrain development in mouse, human, and stem cells," Cell, vol. 167 no. 2, pp. 566-580, DOI: 10.1016/j.cell.2016.09.027, 2016.
[49] B. Tasic, V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, L. T. Gray, S. A. Sorensen, T. Dolbeare, D. Bertagnolli, J. Goldy, N. Shapovalova, S. Parry, C. Lee, K. Smith, A. Bernard, L. Madisen, S. M. Sunkin, M. Hawrylycz, C. Koch, H. Zeng, "Adult mouse cortical cell taxonomy revealed by single cell transcriptomics," Nature Neuroscience, vol. 19 no. 2, pp. 335-346, DOI: 10.1038/nn.4216, 2016.
[50] Z. Amit, "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq," Science, vol. 347 no. 6226, pp. 1138-1142, 2015.
[51] D. Grün, M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, J. van Es, E. Jansen, H. Clevers, E. J. P. de Koning, A. van Oudenaarden, "De novo prediction of stem cell identity using Single-Cell transcriptome data," Cell Stem Cell, vol. 19 no. 2, pp. 266-277, DOI: 10.1016/j.stem.2016.05.010, 2016.
[52] M. Baron, A. Veres, S. L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. H. Ryu, B. K. Wagner, S. S. Shen-Orr, A. M. Klein, D. A. Melton, I. Yanai, "A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure," Cell Systems, vol. 3 no. 4, pp. 346-360, DOI: 10.1016/j.cels.2016.08.011, 2016.
[53] M. J. Muraro, G. Dharmadhikari, D. Grün, N. Groen, T. Dielen, E. Jansen, L. van Gurp, M. A. Engelse, F. Carlotti, E. J. P. de Koning, A. van Oudenaarden, "A single-cell transcriptome atlas of the human pancreas," Cell Systems, vol. 3 no. 4, pp. 385-394, DOI: 10.1016/j.cels.2016.09.002, 2016.
[54] Y. Xin, J. Kim, H. Okamoto, M. Ni, Y. Wei, C. Adler, A. J. Murphy, G. D. Yancopoulos, C. Lin, J. Gromada, "RNA sequencing of single human islet cells reveals type 2 diabetes genes," Cell Metabolism, vol. 24 no. 4, pp. 608-615, DOI: 10.1016/j.cmet.2016.08.018, 2016.
[55] L. Adam, D. Lipowska, "Roulette-wheel selection via stochastic acceptance," Physica A: Statistical Mechanics and Its Applications, vol. 391 no. 6, pp. 2193-2196, 2012.
[56] V. Ho-Huu, T. Nguyen-Thoi, T. Truong-Khac, L. Le-Anh, T. Vo-Duy, "An improved differential evolution based on roulette wheel selection for shape and size optimization of truss structures with frequency constraints," Neural Computing and Applications, vol. 29 no. 1, pp. 167-185, DOI: 10.1007/s00521-016-2426-1, 2018.
[57] W. Chen, G. Feng, "Spectral clustering: a semi-supervised approach," Neurocomputing, vol. 77 no. 1, pp. 229-242, DOI: 10.1016/j.neucom.2011.09.002, 2012.
[58] B. Wang, J. Zhu, E. Pierson, D. Ramazzotti, S. Batzoglou, "Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning," Nature Methods, vol. 14 no. 4, pp. 414-416, DOI: 10.1038/nmeth.4207, 2017.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Junhong Feng et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Single-cell RNA sequencing (scRNA-seq) is emerging as a promising technology. There exist a huge number of genes in a scRNA-seq data. However, some genes are high quality genes, and some are noises and irrelevant genes because of unspecific technology reasons. These noises and irrelevant genes may have a strong influence on downstream data analyses, such as a cell classification, gene function analysis, and cancer biomarker detection. Therefore, it is very significant to obviate these irrelevant genes and choose high quality genes by gene selection methods. In this study, a novel gene selection and classification method is presented by combining the information gain ratio and the genetic algorithm with dynamic crossover (abbreviated as IGRDCGA). The information gain ratio (IGR) is employed to eliminate irrelevant genes roughly and obtain a preliminary gene subset, and then the genetic algorithm with a dynamic crossover (DCGA) is utilized to choose high quality genes finely from the preliminary gene subset. The main difference between the IGRDCGA and the existing methods is that the DCGA and IGR are integrated first and used to select genes from scRNA-seq data. We conduct the IGRDCGA and several competing methods on some real-world scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can choose high quality genes effectively and efficiently and outperforms the other several competing methods in terms of both the dimensionality reduction and the classification accuracy.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer