Introduction
Recent advances in single-cell RNA-seq (scRNA-seq) technologies have enabled the simultaneous measurement of expression levels of thousands of genes across hundreds to thousands of individual cells 1– 8. This opens up new possibilities for deconvolution of expression patterns seen in bulk samples, detection of previously unknown cell populations and deeper characterization of known ones. However, computational analyses are complicated by the high variability, low capture efficiency and high dropout rates of scRNA-seq assays 9– 11, as well as by strong batch effects that are often confounded by the experimental factor of interest 12.
Given a collection of single cells, a common analysis task involves identification and characterization of subpopulations, e.g., cell types or cell states. With lower-dimensional single-cell assays such as flow cytometry, cell type detection is often done manually, by visual inspection of a series of two-dimensional scatter plots of marker pairs (“gating”) and subsequent identification of clusters of cells with specific abundance patterns. With large numbers of markers, such strategies quickly become unfeasible, and they are also likely to miss previously uncharacterized cell populations. Instead, subpopulation detection in higher-dimensional single-cell experiments such as mass cytometry (CyTOF) and scRNA-seq is often done automatically, via some form of clustering. As a consequence, a large number of clustering approaches specifically designed for or adapted to these types of assays are available in the literature 13.
While extensive evaluations of clustering methods have been performed for flow and mass cytometry data
14,
15, there are to date fewer such studies available for scRNA-seq. The latter is complicated by the large number of different data generation protocols available for scRNA-seq, which in turn has a big effect on the data characteristics. Menon
16 specifically evaluated three methods (
Here, we extend these initial studies to a broader range of data sets with different characteristics and additionally consider simulated data with different degrees of cluster separability. We evaluate 14 clustering algorithms, including both methods specifically developed for scRNA-seq data, methods developed for other types of single-cell data, and more general approaches, on a total of 12 different data sets. In order to focus on the performance of the clustering algorithms themselves, we use the same preprocessing approach (specifically cell and gene filtering) for all methods, and investigate the impact of the preprocessing separately. In addition to investigating how well the clustering methods are able to recover the true partition if the number of subpopulations is known, we evaluate whether they are able to correctly determine the number of clusters. Further, we study the stability and run time of the methods and investigate whether performance can be improved by generating a consensus partition based on results from multiple individual clustering methods, and the impact of the choice of methods to include in such an aggregation.
We observed large differences in the clustering results as well as in the run times of the different methods.
Given the high level of activity in methods research for preprocessing, clustering and visualization of scRNA-seq data, it is expected that many new algorithms (or new flavors of existing ones) will be proposed. In order to facilitate re-assessment as new innovations emerge and to provide extensibility to new methods and data sets, all (filtered and unfiltered) data sets as well as all clustering results are accessible via an R/Bioconductor package, leveraging the Bioconductor ExperimentHub framework ( https://bioconductor.org/packages/DuoClustering2018). In addition, the complete code used to run all analyses is available on https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison. The current system uses a Makefile to run a set of R scripts for clustering, summarization and visualization of the results.
Methods
Real data sets
Three real scRNA-seq data sets were downloaded from
conquer
22 and used for our evaluations: GSE60749-GPL13112 (here denoted
Kumar
23), SRP073808 (
Koh
24) and GSE52529-GPL16791 (
Trapnell
25). These data sets were chosen to represent different degrees of “difficulty” in the clustering task. In particular, the
Trapnell data set was not generated with the aim of detecting subpopulations, but rather to investigate a continuous developmental trajectory. Nevertheless, it was included in our evaluation as an example of a data set where the phenotype designated as the “true” cluster labels (see below) may not represent the strongest signal present in the data.
Table 1 and
Supplementary Figure 1 give an overview of all data sets used in this study. For each of the data sets from
conquer, the gene-level length-scaled TPM values (below referred to as “counts” since they are on the same scale as the raw read counts) and the phenotype were extracted from the MultiAssayExperiment
26 object provided by
conquer and used to create a SingleCellExperiment object. We also estimated transcript compatibility counts (TCCs) for each of these data sets using
Table 1.
Overview of the data sets used in the study.
Data set | Sequencing
| #
| #
| Median total
| Median #
| #
| Description | Ref. |
---|---|---|---|---|---|---|---|---|
Koh | SMARTer | 531 | 48,981 | 1,390,268 | 14,277 | 9 | FACS purified H7 human
| 24 |
KohTCC | SMARTer | 531 | 811,938 | 1,391,012 | 66,086 | 9 | FACS purified H7 human
| 24 |
Kumar | SMARTer | 246 | 45,159 | 1,687,810 | 26,146 | 3 | Mouse embryonic stem
| 23 |
KumarTCC | SMARTer | 263 | 803,405 | 717,438 | 63,566 | 3 | Mouse embryonic stem
| 23 |
SimKumar4easy | - | 500 | 43,606 | 1,769,155 | 29,979 | 4 | Simulation using different
| 29 |
SimKumar4hard | - | 499 | 43,638 | 1,766,843 | 30,094 | 4 | Simulation using different
| 29 |
SimKumar8hard | - | 499 | 43,601 | 1,769,174 | 30,068 | 8 | Simulation using different
| 29 |
Trapnell | SMARTer | 222 | 41,111 | 1,925,259 | 13,809 | 3 | Human skeletal muscle
| 25 |
TrapnellTCC | SMARTer | 227 | 684,953 | 1,819,294 | 66,864 | 3 | Human skeletal muscle
| 25 |
Zhengmix4eq | 10xGenomics
| 3,994 | 15,568 | 1,215 | 487 | 4 | Mixtures of FACS
| 5 |
Zhengmix4uneq | 10xGenomics
| 6,498 | 16,443 | 1,145 | 485 | 4 | Mixtures of FACS
| 5 |
Zhengmix8eq | 10xGenomics
| 3,994 | 15,716 | 1,298 | 523 | 8 | Mixtures of FACS
| 5 |
The selected cell phenotype was used to define the "true" partition of cells when evaluating the clustering methods. For the Kumar data set, we grouped the cells by the genetic perturbation and the medium in which they were grown. For the Trapnell data set we used the time point (after the switch of growth medium) at which the cells were captured, and for the Koh data set we used the cell type annotated by the data collectors (obtained through FACS sorting). We note that the definition of the ground truth constitutes an intrinsic difficulty in the evaluation of clustering methods, since it is plausible that there are several different, but still biologically interpretable, ways of partitioning cells in a given data set, several of which can represent equally strong signals. Many public droplet-based data sets contain cell type labels, but these are typically inferred by clustering the cells using the scRNA-seq data, and thus any evaluation based on these labels risks being biased in favor of methods similar to the one used to derive the labels in the first place. By using ground truths that are defined independently of the scRNA-seq assay, we thus avoid artificial inflation of the signal that could result if the truth was derived from the scRNA-seq data itself.
In addition to the data sets from conquer, we obtained UMI counts from the Zheng data set 5, generated by the 10x Genomics GemCode protocol, from https://support.10xgenomics.com/single-cell-gene-expression/datasets. We downloaded counts for eight pre-sorted cell types (B-cells, naive cytotoxic T-cells, CD14 monocytes, regulatory T-cells, CD56 NK cells, memory T-cells, CD4 T-helper cells and naive T-cells) and combined them into three data sets, with a mix of well-separated (e.g., B-cells vs T-cells) and similar cell types (e.g., different types of T-cells) and uniform and non-uniform cluster sizes. For the data set denoted Zhengmix4eq, we combined randomly selected B-cells, CD14 monocytes, naive cytotoxic T-cells and regulatory T-cells in equal proportions (1,000 cells per subpopulation). For the Zhengmix4uneq data set, we combined the same four cell types, but in unequal proportions (1,000 B-cells, 500 naive cytotoxic T-cells, 2,000 CD14 monocytes and 3,000 regulatory T-cells). For the Zhengmix8eq data set, we combined cells from all eight populations, in approximately equal proportions (400–600 cells per population). For these data sets, we used the annotated cell type (obtained by pre-sorting of the cells) as the true cell label.
Simulated data sets
Using one subpopulation of the
Kumar data set as input, we simulated scRNA-seq data with known group structure, using the
Data processing
The
Gene filtering
We evaluated three methods for reducing the number of genes provided as input to the clustering methods. For each filtering method, we retained 10% of the original number of genes (with a non-zero count in at least one cell) in the respective data sets. First, we retained only the genes with the highest average expression (log-normalized count) value across all cells (denoted Expr below). Second, we used
Clustering methods
Fourteen clustering methods, publicly available as R packages or scripts, were evaluated in this study (see
Table 2 for an overview). We included general-purpose clustering methods, such as hierarchical clustering and K-means, as well as methods developed specifically for scRNA-seq data, such as
Table 2.
Clustering methods.
Method | Description | Reference |
---|---|---|
| PCA dimension reduction (dim=30) and iterative hierarchical clustering | 36 |
| PCA dimension reduction based on zero-imputed similarities, followed by hierarchical clustering | 37 |
| PCA dimension reduction (dim=30) followed by self-organizing maps (5x5, 8x8 or 15x15 grid,
| 38 |
| t-SNE dimension reduction (initial PCA dim=50, t-SNE dim=3) followed by density-based clustering | 25, 39 |
| PCA dimension reduction (dim=30) and hierarchical clustering with Ward.D2 linkage | 33, 40 |
| PCA dimension reduction (dim=30) and K-means clustering with 25 random starts | 33, 41 |
| PCA dimension reduction (dim=30) and k-means clustering through an iterative process.
| 42 |
| K-medoids clustering based on Pearson correlation dissimilarities | 43 |
| t-SNE dimension reduction (initial PCA dim=50, t-SNE dim=3, perplexity=30) and K-means
| 34, 41, 44 |
| Ensemble clustering using SC3, CIDR, Seurat and t-SNE + Kmeans | 45 |
| PCA dimension reduction or Laplacian graph. K-means clustering on different dimensions.
| 46 |
| Using SC3 to derive the clusters for half of the cells, then using a support vector machine (SVM)
| 46, 47 |
| Dimension reduction by PCA (dim=30) followed by nearest neighbor graph clustering | 17 |
| PCA dimension reduction followed by model-based clustering | 48 |
All methods except
Evaluation criteria
In order to evaluate how well the inferred clusters recovered the true subpopulations, we used the Hubert-Arabie Adjusted Rand Index (ARI) for comparing two partitions
49. The metric is adjusted for chance, such that independent clusterings have an expected index of zero and identical partitions have an ARI equal to 1, and was calculated using the implementation in the
We used a normalized Shannon entropy 50 to evaluate whether the methods preferentially partitioned the cells into clusters of equal size, or whether they preferred to generate some large and some small clusters. Given proportions p 1, . . . , p N of cells assigned to each of N clusters, the normalized Shannon entropy is defined by
Since the true degree of equality of the cluster sizes varies between data sets, we subtracted the normalized entropy calculated from the true partition to obtain the final performance index.
To evaluate the similarities between the partitions obtained by different methods, we first calculated a consensus partition from the five independent runs for each method, using the
Finally, we investigated whether clustering performance was improved by combining two methods into an ensemble. For each data set, and with the true number of clusters imposed, we calculated a consensus partition for each pair of methods using the
Results
Large differences in performance across data sets and methods
The 14 methods were tested on real data sets as well as simulations with a varying degree of complexity ( Table 1) and across a range of the number of subpopulations. Focusing on the agreement between the true partitions and the clusterings obtained by imposing the true number of clusters showed a large difference between data sets as well as between methods ( Figure 1; a summary across different numbers of clusters can be found in Supplementary Figure 3).
Figure 1.
Median ARI scores, representing the agreement between the true partition and the one obtained by each method, when the number of clusters is fixed to the true number.
Each row corresponds to a different data set, each panel to a different gene filtering method, and each column to a different clustering method. The methods and the data sets are ordered by their mean ARI across the filterings and data sets. Some methods failed to return a clustering with the correct number of clusters for certain data sets (indicated by white squares).
As expected, excellent performances were achieved for the well-separated data sets with a strong difference between the groups of cells (
Kumar,
KumarTCC and
SimKumar4easy). When filtering by expression or variability, close to all methods achieved a correct partitioning of the cells in these data sets, while the
We note that the
While none of the methods consistently outperformed the others over the full range of the imposed numbers of clusters in all data sets,
We also investigated whether the number of detected features per cell differed between the clusters, using a Kruskal-Wallis test 52. No strong association was found for the simulated data sets ( Supplementary Figure 6), indicating that there is low inherent bias in the clustering algorithms. For most of the real data sets, we found highly significant differences in the number of detected features between cells in different clusters. However, it is unclear whether this represents a technical effect or a biological difference between the cell populations.
Run times vary widely between methods
We measured the elapsed time for each run, using a single core and excluding the time to estimate the number of clusters if this was done via a separate function. Since the run times are strongly dependent on the number of features and cells in a data set, we represent them as normalized run times, by dividing with the time required for
Figure 2.
(
A) Normalized run times, using
Plotting the run time versus the ARI for a subset of the data sets (excluding the ones with the strongest signal, where all methods found the correct clusters, and the TCC data sets) (
Figure 2B) further illustrated the variability between the methods. Interestingly,
The scalability of the methods was investigated by subsampling the largest data set (
Zhengmix4uneq) and plotting the run time as a function of the number of cells (
Supplementary Figure 9). The majority of the methods showed a linear increase in run time as a function of the number of cells, while
High stability between clustering runs
Figure 1 illustrated the average performance of each method across the five runs on each data set, for the true number of clusters. By comparing the partitions obtained in the individual runs, we could also obtain a measure of the stability of each method ( Figure 3A).
Figure 3.
( A) Median stability (ARI across different runs on the same data set) for the methods, with the annotated number of clusters imposed. Some methods failed to return a clustering with the correct number of clusters for certain data sets (indicated by white squares). ( B) The difference between the normalized entropy of the obtained clusterings and that of the true partitions, across all data sets and for the annotated number of clusters. ( C) The difference between the number of clusters giving the maximal ARI and the annotated number of clusters, across all data sets.
Figure 4.
Clustering of the methods based on the average similarity of their partitions across data sets, for the true number of clusters.
Numbers on internal nodes indicate the fraction of dendrograms from individual data sets where a particular subcluster was found.
A summary of the variability both within and between the different filterings is shown in
Supplementary Figure 10. It is worth noting that comparing the performances between the different filtering approaches is difficult for two reasons: first, the variability of the clustering runs for a given filtering might exceed the variation between the filterings, and second, filtering with
Qualitative differences between cluster characteristics
By computing the Shannon entropy for the various partitions, we obtained a measure of the equality of the sizes of the clusters (
Figure 3B). Since the true degree of cluster size uniformity as well as the number of clusters are different between data sets, we compared the normalized Shannon entropy of the clusterings to that of the true partitions. Thus, a positive value of this statistic indicates that a method tends to produce more equally sized clusters than the true ones, and a negative value instead indicates that the method tends to return more unequal cluster sizes, e.g., one large cluster and a few small ones. Most methods gave cluster sizes that were compatible with the true sizes for most data sets (a statistic close to 0), while especially
The optimal number of clusters can differ from the ”true” one
Above, we investigated the performance and stability of the methods when the true number of clusters (the number of different labels in the partitioning considered as the ground truth) was imposed. Whether this number of clusters actually provided the highest ARI value (i.e., the best agreement with the ground truth) mainly depended on the difficulty of the clustering task (
Figure 3C), and the choice of method. No method achieved the best performance at the annotated number of clusters in all the data sets, although generally, the methods reached their maximum performance at or near the annotated number of clusters. The notable exception was
Inconsistent degree of similarity between methods
The similarity between each pair of methods was quantified by means of the ARIs for each pair of consensus clusterings (across the five runs of each method for each data set and number of clusters).
Figure 4 shows a dendrogram of the methods obtained by hierarchical clustering based on the average ARI values across all data sets for the true number of clusters. The numbers shown at the internal nodes indicate the stability of the subclusters, that is, the fraction of the corresponding dendrograms from the individual data sets where a particular subcluster could be found. In general, the groupings of the methods shown in the dendrogram were unstable across data sets, indicated by the low stability fractions of all subclusters. This is consistent with previous studies showing generally poor concordance that varied across data sets
20,
45. Even
Ensembles often don’t improve clustering performance
Next, we investigated whether we could improve the clustering performance by combining methods into an ensemble. For each pair of methods, we generated a consensus clustering and evaluated its agreement with the true partition using the ARI. In general, the performance of the ensemble was worse than the better of the two combined methods, and better than the worse of the two methods (
Figure 5A), suggesting that we would obtain a better performance by choosing a single good clustering method rather than combining multiple different ones. This is largely consistent with a recent study evaluating the combination of four methods (
Figure 5.
Comparison between individual methods and ensembles.
( A) Difference between the ARI of each ensemble and the ARI of the best (left) and worst (right) of the two methods in the ensemble, across all data sets and for the true number of clusters. ( B) Difference between the ARI of each ensemble and each of the components, across all data sets and for the true number of clusters. The histogram in row i, column j represents the differences between the ARIs of the ensemble of the methods in row i and column j and the ARI of the method in row i on its own.
Discussion and conclusions
In this study, we have evaluated 14 clustering methods on both real and simulated scRNA-seq data. There were large differences in the ability of the methods to recover the annotated clusters, and performance was also strongly dependent on the degree of separation between the true classes.
The same preprocessing steps and fixed gene sets were used for all clustering methods. This enabled us to investigate the impact of the clustering algorithm itself, rather than entire pipelines or workflows. The selection of the filtering approach had an impact on the majority of the methods and resulted in different clustering solutions. Specifically for the more difficult data sets there was a higher dissimilarity. However, this did not necessarily affect the performances of the methods.
The stability of clustering algorithms can be evaluated by generating perturbed subsamples of the data set and redoing the clusterings. These subsamples can be created in several ways, e.g., by random subsampling with or without replacement, by adding noise to the original data
53 or by simulating technical replicates
54. Freytag
20 showed that
The evaluated methods are based on a broad spectrum of approaches for dimensionality reduction and clustering. We note that the majority of the methods use PCA or PCoA for dimension reduction or Euclidean distances as the distance metric (
We investigated the impact of changing the imposed number of clusters for the different methods, which revealed that a subset of the methods, in particular
Data availability
The R/Bioconductor data package
Archived R scripts as at time of publication are available from https://doi.org/10.5281/zenodo.1314743
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2018 Duò A et al. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves.
We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering.
All the code used for the evaluation is available on GitHub ( https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( https://bioconductor.org/packages/DuoClustering2018).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer