Full Text

Turn on search term navigation

Introduction

Gene expression profiling provides a powerful approach for generating hypotheses about the underlying biological mechanisms that give rise to changes in gene expression under different conditions. This is a critical step in advancing our understanding of complex biological systems and diseases. With over a decade of development, bulk RNA sequencing (bulk RNA-seq) has facilitated the creation of extensive databases, such as the Genotype Tissue Expression Project (GTEx)^[¹^] and the Cancer Gene Expression Atlas (TCGA),^[^1b^] offering a valuable resource for investigating gene expression in diverse tissues and diseases. These large-scale datasets enable researchers to explore the complex regulatory networks and molecular pathways underlying various biological processes, providing a foundation for the development of novel therapeutic interventions.

However, bulk RNA-seq is commonly limited by cellular heterogeneity, which is a significant challenge across samples. The mixed cell population within each sample results in an overall signature that represents only the average state of the population, thereby limiting the ability to capture the full spectrum of cellular heterogeneity. Single-cell RNA sequencing (scRNA-seq) provides a means of obtaining transcriptomes that are specific to individual cells;^[²^] However, it is often more expensive and technically challenging than mature bulk RNA-seq,^[³^] which are still widely used due to their cost-effectiveness and scalability. As an alternative, we propose algorithms for computational deconvolution, which is the process of separating a heterogeneous mixture of signals into its constituent cell type-specific signals derived from the data.^[⁴^] This approach provides a cost-effective method to obtain cell composition information and significantly increases the speed and scale of related applications, which can be employed in various biological and clinical research contexts. Its value extends to the analysis of specific drug treatment effects or changes in conditions on cell types, providing a highly cost-effective solution.

Many computational deconvolution methods have been devised to identify the composition of cell types or their specific states within bulk samples.^[⁵^] These methods can be classified into unsupervised and supervised categories depending on whether reference information is available, such as pure cell type expression profiles or lists of marker genes. Fully unsupervised methods that utilize non-negative matrix factorization (NMF)^[^4b^] have low deconvolution accuracy and require meaningful gene signatures or expression profiles of distinct cell types for interpretation. Thus, the most frequently employed methods belong to the supervised category, utilizing marker gene-based approaches to quantify each cell type independently in heterogeneous samples by using expression values of specific marker genes. MCPcounter^[⁶^] aggregates them into abundance scores, and xCell^[⁷^] performs statistical tests for marker gene enrichment. Other methods, such as DSA,^[⁸^] MMAD,^[⁹^] and CAMmarker,^[¹⁰^] use marker gene lists to guide deconvolution analysis. UNDO^[¹¹^] is an algorithm that automatically detects cell-specific marker genes within the mixed gene expression scatter radius. It then estimates the proportion of cell types in each sample and deconvolutes the mixed expression by generating a cell-specific expression profile. TIMER^[¹²^] is a web-based tool that utilizes a novel statistical method to estimate the relative abundance of six different immune cell types present in the tumor microenvironment. Alternatively, single-cell reference-based methods express the deconvolution problem as a set of equations that represent the gene expression of a sample as a weighted sum of mixed cell type expression profiles,^[¹³^] this approach enables the inference of bulk cell type proportions within a sample using scRNA-seq. Scaden^[¹⁴^] is a deconvolution approach that employs a neural network and it trained on large-scale scRNA-Seq datasets to improve robustness. Bisque^[¹⁵^] learns bulk gene-specific transformations from scRNA-seq, this approach eliminates the technical bias that may arise due to differences in sequencing techniques between single-cell and bulk RNA-seq. MuSiC^[¹⁶^] proposed a weighted non-negative least squares regression framework. This framework simultaneously weights each gene by cross-sample and cross-cell variation. SCDC^[¹⁷^] extends the MuSiC method by proposing an ensemble framework that utilizes multiple scRNA-seq datasets as reference and can implicitly address batching effects between reference datasets in different experiments.

In summary, computational deconvolution provides a valuable method to obtain cell composition information and analyze the specific effects of drug treatments or changes in conditions on cell types. However, existing methods have limitations, including their reliance on specific datasets, the need for a user-supplied matrix of cell-type characteristics, and challenges posed by differences in data distribution between single-cell and bulk RNA-seq data. To overcome these limitations, we propose a novel approach that resolves differences in data distribution between single-cell and bulk RNA-seq data. It constructs cell type-specific expression profiles based on scRNA-seq, and importantly, does not require a user-supplied matrix. Our model accurately identifies bulk tissue cell types, addressing the limitations of existing methods and providing a valuable tool for analyzing the effects of drug treatments or changes in conditions on cell types.

Experimental Section Datasets and Preprocessing

The performance of computational deconvolution using two ways was evaluated. First, “pseudobulk” experiments, which was a popular approach that involved aggregating scRNA-seq measurements from cells to create gene expression mixtures proportional to the known cell types were performed. By utilizing pseudobulk experiments, this study was able to simulate bulk RNA-seq data and evaluate the accuracy of the deconvolution algorithm for diverse cell types. Second, bulk RNA-seq with known cell types proportions to further evaluate the performance of the deconvolution algorithm was utilized. This approach allowed us to compare the results with the known cell type proportions, providing an additional measure of the accuracy of the method. The method using these two approaches was benchmarked.

Pseudobulk Data

In the study, data from The Tabula Muris Consortium^[¹⁸^] which includes data on numerous organs/tissues and many cell types through two distinct single-cell experimental protocols: Smart-seq2^[¹⁹^] and 10x Genomics Chromium was utilized. Through simulations across a diverse array of scenarios, it endeavors to comprehensively account for various real-world situations. For the in silico experiments, This study specifically selected on the following organs/tissues: Kidney, Large_intestine, Liver, Lung, Pancreas, Skin, Thymus, and Trachea (as shown in Table 1) for the in silico experiments. Two different deconvolutions for each tissue were performed. To perform cross-protocol deconvolution, in one configuration, 10x Genomics Chromium as single-cell reference and Smart-seq2 as the pseudobulk, and vice versa for the other configuration were used.

Table 1 Cell types for every organ in the pseduobulk experiments.

Organs	# cell types	Cell types
Kidney	6	T cell, epithelial cell of proximal tubule, kidney collecting duct principal cell, macrophage, fenestrated cell, B cell
Large intestine	4	Intestinal crypt stem cell, enterocyte of epithelium of large intestine, epithelial cell of large intestine, large intestine goblet cell
Liver	6	Hepatocyte, Kupffer cell, myeloid leukocyte, NK cell, endothelial cell of hepatic sinusoid, B cell
Lung	9	B cell, monocyte, macrophage, myeloid cell, endothelial cell, epithelial cell, lymphocyte, stromal cell, granulocyte
Pancreas	9	Pancreatic ductal cell, pancreatic A cell, pancreatic B cell, pancreatic acinar cell, leukocyte, pancreatic D cell, endothelial cell, pancreatic PP cell, pancreatic stellate cell
Skin	3	Keratinocyte stem cell, epidermal cell, basal cell of epidermis
Thymus	3	Professional antigen presenting cell, thymocyte, DN4 thymocyte
Trachea	7	Fibroblast, endothelial cell, basal epithelial cell of tracheobronchial tree, granulocyte, macrophage, chondrocyte, smooth muscle cell of trachea

The order in which the cell types are listed here corresponds to the order in any figures that were based on the pseudobulk experiments using the Tabula Muris Senis data.

Real Bulk RNA-Seq Data

In this section, a detailed description of the real datasets that were used to compare different methods was offered. The first dataset was obtained from Dong^[¹⁷^] and consists of human breast cancer and fibroblast cell lines and mixtures. In the second dataset, bulk whole blood samples from Newman were obtained.^[²⁰^] However, neutrophils were not present in the single-cell references matched to the bulk RNA-seq, so neutrophils from Xie^[²¹^] instead was used. The third dataset was obtained from Monaco^[²²^] and consists of bulk whole blood, with single-cell references from 10x Genomics and Xie^[²¹^] used for comparison. In the processing of these public datasets, whether for scRNA-seq or bulk RNA-seq, this study has applied the default normalization methods. The final dataset used for method comparison was generated from our own biological experiment on mouse spinal cord tissue, with both bulk and single-cell RNA sequencing data obtained. The details of the methods used to generate and analyze the data were provided in the subsequent sections.

scRNA-Seq Preprocessing and Analysis Spinal Cord-Derived NSCs Culture and Single Cell Solution Preparation

Animal experiments in this study were conducted in strict accordance with the 1996 revised National Institutes of Health Guide for the Care and Use of Laboratory Animals (NIH Publication No. 80-23) and were approved by the Medical Ethics Committee of Xijing Hospital of the Fourth Military Medical University. Every effort to minimize animal suffering and reduce the number of animals used was made.

To isolate neural stem cells (NSCs), a previously published method^[²³^] and extracted cells from the spinal cords of 6-week-old adult C57/BL6 mice were followed. In brief, three female adult C57/BL6 mice for the experiments were used. The mice were anesthetized with 5% isoflurane and underwent laminectomy at the T6-12 vertebral level. The T6-T12 spinal cords were isolated from the mice and placed in 10 cm Petri dishes containing cold PBS. The tissue was then dissociated and filtered through 70 µm filters (BD Falcon) to obtain a single-cell suspension. The isolated cells were seeded in NSC growth medium consisting of Dulbecco's Modified Eagle Medium (DMEM/F12), B-27, and N-2 supplements (Gibco) at a density of 1 × 106 cells per mL.

To prepare single-cell solutions, the NSCs to enzymatic digestion with Accutase (Gibco) for two passages of 10 min at 37 °C were subjected. The enzymatic digestion was then stopped with NSC growth medium, and the resulting cell suspension was filtered through a 70 µm filter (BD Falcon). The single-cell solutions were stored on ice and loaded into BD Rhapsody cassettes to capture the single-cell transcriptomes.

Single Cell Transcriptome Capturation, Library Construction and Sequencing

In order to preserve cell precision and viability, this study used a staining protocol involving two fluorescent materials, Calcein AM and Draq7, and analyzed the cells using a BD Rhapsody Scanner (BD Biosciences). The cells were then loaded into a BD Rhapsody microwell cassette according to the protocol developed by Fan.^[²⁴^] To ensure that nearly every microwell contained a single cell capture bead, an excess of beads and washed off any excess beads from the column were loaded. Once the cells have been lysed with the lysis buffer, the cell capture beads need to be recovered and washed before reverse transcription can be performed. To generate bead-captured single-cell transcriptomes containing cell label and unique molecular identifier (UMI) information, this study used the BD Rhapsody cDNA Kit (BD Biosciences, Cat. No. 633773) and the BD Rhapsody Targeted mRNA and AbSeq Amplification Kit (BD Biosciences, Cat. No. 633801) following the manufacturer's protocol. All libraries were sequenced in paired-end 150 bp mode using the NovaSeq platform (Illumina).

Sequencing Data Processing

The raw sequencing reads from a cDNA library using the BD Rhapsody Whole Transcriptome Assay Analysis Pipeline (v1.8) were processed. The pipeline includes several steps, such as quality filtering, annotation, identification of cells, and generation of single-cell expression matrices. The output file contains a matrix of UMI counts per cell and gene, which was used for downstream analysis.

Dimensionality Reduction, Clustering and Visualization

First, the Seurat R package^[²⁵^] to read the original counts matrix of each sample and create a Seurat object was used. This study then filtered out poor quality cells based on the following parameters: percentage of mitochondria below 20, number of UMIs per cell greater than 500, and number of detected genes greater than 200. Normalization and scaling, followed by PCA dimensionality reduction based on the top 2000 highly variable genes were performed. Clustering was performed using the Seurat package's built-in clustering method. To visualize the dimensionality reduction and clustering results, t-SNE was used.

RNA-Seq Preprocessing and Analysis

Regarding the RNA-seq datasets analyzed in this study, no additional processing steps were applied. This study directly utilized the count or expression tables obtained from all downloaded datasets and subsequently subjected them to normalization.

To benchmark SCROAM against other available methods, including Bisque,^[¹⁵^] MuSiC,^[¹⁶^] SCDC,^[¹⁷^] and Non-Negative Least Squares (NNLS),^[²⁶^] all algorithms in each deconvolution with the same data were used. This study ran all algorithms using their respective tutorials with default settings, unless otherwise specified. The NNLS results were obtained from MuSiC's implementation.

Methods

The SCROAM model was developed to estimate cell type proportion in bulk tissues and apply it to study changes in cell type proportions in diseased tissues, providing valuable information for subsequent treatment of the disease. As depicted in Figure 1, SCROAM begins with a single-cell reference and assumes that each cell has been categorized to a fixed set of cell types. SCROAM can then deconvolve bulk samples from the same tissue and estimate the proportion of different cell type present in the sample. When scRNA-seq data was utilized as a reference for cell type deconvolution, the difference in sequencing distribution from bulk RNA-seq must be taken into account. When constructing a feature matrix from scRNA-seq, SCROAM was not based on the average expression, but instead, each gene was weighted according to its cell-specific score, allowing for larger gene sets to be used in deconvolution. Further details regarding this approach can be found in the Methods section below.

View Image - Figure 1. Overview of SCROAM. a) The deconvolution model that uses a reference requires two input datasets: bulk RNA-seq count and a reference containing counts of scRNA-seq reads. Additionally, the single-cell transcriptome data must label the cell type to be quantified. b) SCROAM learns gene-specific transformations of bulk data by utilizing the reference sequences observed in single-cell data. This allows us to account for potential technical bias between sequencing technologies used in single-cell and bulk RNA-seq data. c) SCROAM begins with scRNA-seq data and classifies the cells into different cell types, which were represented by different colors in the analysis. By calculating gene specificity in a given cell type, an expression matrix reflecting cell type specificity was constructed. d) SCROAM employs single-cell reference data to estimate the cell type ratio in transformed bulk data.

Figure 1. Overview of SCROAM. a) The deconvolution model that uses a reference requires two input datasets: bulk RNA-seq count and a reference containing counts of scRNA-seq reads. Additionally, the single-cell transcriptome data must label the cell type to be quantified. b) SCROAM learns gene-specific transformations of bulk data by utilizing the reference sequences observed in single-cell data. This allows us to account for potential technical bias between sequencing technologies used in single-cell and bulk RNA-seq data. c) SCROAM begins with scRNA-seq data and classifies the cells into different cell types, which were represented by different colors in the analysis. By calculating gene specificity in a given cell type, an expression matrix reflecting cell type specificity was constructed. d) SCROAM employs single-cell reference data to estimate the cell type ratio in transformed bulk data.

Bulk RNA-Seq Data Transformation

Inspired by domain adaptation learning, deep learning techniques can be utilized to address the problem of sequencing data differences by treating two types of sequencing data as the training set and test set of samples. Therefore, in this context, the Kernel Mean Matching^[²⁷^] (KMM) strategy, which exhibits robustness against noise and outliers and makes no assumptions about sample data, yielding superior performance in mitigating discrepancies among different samples was employed. The core idea was as follows: for a given sequencing dataset, this study obtains its corresponding feature space and calculate the dissimilarity between two feature spaces as weights for the samples. These weights were then used to iteratively adjust the classifier to eliminate any disparities in the data distribution.

To begin, it first intersect the genes between scRNA-seq and bulk RNA-seq data to obtain the public gene G. The dimensionality of each cell or sample was then defined as the number of common genes, which was denoted as d. In d-dimensional space, it obtain a set x_bulk = {x_i,i = 1, …, n_bulk} consisting of n_bulk independent and identically distributed (i.i.d) samples expressions with a probability density function p_bulk(x), as well as another set x_sc = {x_j,j = 1, …, n_sc} consisting of n_sc independent and identically distributed (i.i.d) cells from a different distribution with a probability density function p_sc(x). To make the different sample distributions consistent, this study estimate the density ratio $\beta \ ( x ) = \frac{{{p}_{sc}( x )}}{{{p}_{bulk}( x )}}\$ from the given finite samples x_bulk and x_sc.

Next, it transform them from its original high-dimensional space to a Reproducing Kernel Hilbert Space (RKHS) by mapping $\Phi ( x ):x \to \mathcal{F}$ . This study aims to minimize the weighted distribution β(x)p_bulk(x) and the distribution p_sc(x) in Maximum Mean Discrepancy (MMD)^[²⁸^] between the RKHS to estimate sample weights [Image Omitted. See PDF]

Using the average value of x_bulk and x_sc instead of the expected value, the new equation is: [Image Omitted. See PDF]

Among them, ${K}_{{x}_{bulk},{x}_{bulk}}$ and ${K}_{{x}_{bulk},{x}_{sc}}$ are two Gram kernel matrices, and ${1}_{{n}_{sc},1}$ is a vector of size n_sc × 1. [Image Omitted. See PDF][Image Omitted. See PDF][Image Omitted. See PDF]

This new equation is subject to two constraints. The first constraint bounds β_i between 0 and B, reflecting the range of differences between p_bulk(x) and p_sc(x). The second constraint $| {\frac{1}{{{n}_{bulk}}}\sum_{i\ = \ 1}^{{n}_{bulk}} {{\bf{\beta }}}_i - 1} | \le \epsilon$ is the normalization of β(x), because p_sc (x) = β(x)p_bulk(x) should approximate the probability density function. Small values of ε represent normalized precision. The parameter B was set to the default value of 1000, ε to $( {\sqrt {{n}_{bulk}} - 1} )/\sqrt {{n}_{bulk}}$ and the kernel bandwidth σ to the median of pairwise sample distances,^[²⁸^] which more detailed information on parameter selection can be found in.^[²⁹^] Solving Equation (2) requires solving a convex quadratic programming (QP) problem with linear constraints, which can be accomplished using any existing QP solver.^[³⁰^]

Construct Cell Type Signature Matrix

Single-cell transcriptome data were used to construct feature matrices for cell type deconvolution. Regression-based deconvolution methods typically formulate the deconvolution problem as a system of linear equations, Sx = t, where S is a n × k gene feature matrix (n is the number of genes, k is the number of cell types), and t is a vector of n × 1, representing the bulk RNA-seq data. The vector x is a k × 1 vector that represents the proportion of cell types, where n ≫ k, resulting in an overdetermined equation and has no exact solution. Most existing methods for solving the overdetermined equation Sx = t were based on the ordinary least squares (OLS) method, which involves minimizing the sum of squares of the absolute error.^[^8,12,31^] However, this approach can result in undesirable consequences, as it may not effectively take into account all informative genes. For instance, if the gene has a low average expression level, its contribution may be small even if its expression varies widely between various cell types. Therefore, the new approach for constructing a feature matrix, inspired by the single-cell differential marker gene approach was proposed.^[³²^] In this strategy, cell types were identified using cluster analysis, and the specificity contribution of each gene was calculated by the cosine similarity between genes for each cell type, and then averaging the gene expression levels for each gene expressed in all cells associated with the cell type were then averaged and multiplied with the contribution matrix, resulting in the cell type signature matrix S. The details were as follows.

To compare the expression patterns of two genes in a given cell cluster, this study can represent the expression of each gene using vectors in an n-dimensional cell space. In this space, a vector for each gene, where each vector consists of n bases, was created. Here, n represents the total number of detected cells in the population. The value of the vector was the expression level of a gene in all cells. After normalization, this study obtains the gene expression matrix X ∈ R^{N × M}, where N is the number of cells and M is the number of genes. For the i_th gene, g_i expressed in all cells is represented by the i_th column of X: ${{\bf{g}}}_i = [ { \def\eqcellsep{&}\begin{array}{@{}*{1}{c}@{}} {{x}_{1i}}\\ \vdots \\ {{x}_{ni}} \end{array} } ]\$ , where X_ji is the expression of g_i of the j_th cell c_j,j ∈ {1, …, N}. The variable K is typically used to denote the number of cell classes, which were either predefined through manual annotation or identified through unsupervised cell clustering. To create an ideal gene v_k that is expressed only in a given cell type C_k,k ∈ {1, …, K} and not in any other cell population, it set the ideal marker gene v_k in C_k: ${v}_k = [ { \def\eqcellsep{&}\begin{array}{@{}*{1}{c}@{}} {{v}_{1k}}\\ \vdots \\ {{v}_{nk}} \end{array} } ]\$ . Here, v_jk = 1 if c_j ∈ C_k and v_jk = 0 if c_j∉C_k

The cosine similarity between two genes in a scRNA-seq dataset was calculated as the cosine of the angle between their representation vectors in the cell space. A smaller angle indicates that the expression patterns of the two genes were more similar. If the expression patterns of the two genes were identical, the angle between their representative vectors will be zero, independent of their differences in abundance. To compare the representation vector (g_i,i ∈ {1, …, G}) of each gene with the ideal representation vector v_k, the cosine similarity between g_i and v_k, using the formula cos(g_i,v_k) was calculated [Image Omitted. See PDF]

To determine the gene-specific score of each gene in each cell type, this study aims to maximize the cosine similarity between the gene's representation vector and the ideal vector v_k of its corresponding cell type while minimizing the cosine similarity between the gene's representation vector and the ideal vector of other cell types. The $contribut{e}_{score( {{g}_i,{C}_k} )}$ for each was calculated gene as follows: [Image Omitted. See PDF]

Here, u ≥ 0 represents the degree of penalty imposed on a gene's representation vector when it matches the ideal vector of a non-target cell type (C_t, t ∈ {1, …, K} and t ≠ k). The value of u can be adjusted by the user, with a higher value of u indicating a greater punishment for genes that were expressed in non-target cell types. By default, u = 1 was set.

Cell Type Proportion Estimation

This section describe the connection between bulk and scRNA-seq gene expression. This relationship was essential to the deconvolution process, which this study estimates cell type proportions using the W-NNLS framework proposed by MuSiC.^[¹⁶^] After normalization, X_gc represents all the expression levels of gene g in cell c, and the total number of cells in cell type k is denoted as m^k = |C^k| . Thus, the total abundance of gene g is given by ${X}_g = \sum_{k = 1}^K \ \sum_{c \in {C}^k} {X}_{gc}$ and the total number of cells is denoted as $m\ = \sum_{k = 1}^K \ {m}^k$ . The fraction of cells from cell type k is given by ${p}^k = \frac{{{m}^k}}{m}\$ . the average abundance of gene g in cell type k could calculated using the following formula: [Image Omitted. See PDF][Image Omitted. See PDF]

The relative abundance of gene g in bulk tissue can be expressed as ${Y}_g = \frac{{{Y}_g}}{{\mathop \sum \nolimits_{g^{\prime} = 1}^G {Y}_{g^{\prime}}}}\$ , where Y = βx_bulk is the transformed gene expression in bulk tissues, and G is the total number of genes. Therefore, [Image Omitted. See PDF]

S is the feature matrix that calculated and is a constant.

To estimate ${p}_j = \{ {p_j^k,k = 1, \ldots ,K} \}\$ , it need to satisfy two constraints. The first constraint is non-negativity, which requires that $p_j^k \ge 0$ ,for all j and k. The second constraint is $\sum_{k\ = \ 1}^K \ p_j^k = \ 1$ ,for all j. Since the relationship between bulk and scRNA-seq expression profiles, as deduced in Equation (10), is a “proportional” relationship, this study needs a normalization constant to satisfy the second constraint.

Results and Discussion Performance on Simulated Data

To establish the validity of SCROAM, we performed in silico experiments using scRNA-seq data obtained from the Tabula Muris Senis Consortium.^[¹⁸^] We analyzed eight organs, each containing three to nine cell types, and generated “pseudobulks” by pooling reads from known proportions of labeled cells. This allowed us to evaluate the performance of SCROAM in comparison to recently published methods and Non-Negative Least Squares (NNLS). To evaluate the performance of the different methods, we calculated the L1 distance (absolute difference) between the estimated cell type proportions and the true proportions, and then divided the result by the number of cell types present in the sample. We observed that SCROAM had the lowest mean error among the evaluated methods in the reference/bulk configuration. (Figure 2A,B). Additionally, we visualized the aggregate error by organ (Figure 2C,D) and found that SCROAM was robust across different contexts. We directly compared the error of each method with that of SCROAM on the same deconvolution task, and found that SCROAM produced less error than the other methods in most cases. Although the nominal improvement of SCROAM in averaging L1 metrics per cell type initially appears small, it is important to note that most tissues are comprised of multiple cell types. Therefore, the overall error can accumulate quickly. Additionally, since the counts for each organ were obtained using both Smart-seq2 and 10x Chromium protocols, cross-protocol comparisons could be easily performed. This is especially important given the different techniques required to generate bulk and single-cell RNA-seq samples. Overall, the findings of this study indicate that SCROAM outperforms other existing methods, particularly in the Smart-seq2 reference and 10x Chromium pseudobulk configuration. Furthermore, the robustness of SCROAM across different organs and experimental protocols highlights its potential as a universal method for deconvolving cell type-specific gene expression from bulk RNA-seq data. For a detailed summary of our results, please refer to Tables S1 and S2 (Supporting Information).

View Image - Figure 2. The figure displays the error distribution for each method in the pseudobulk experiment utilizing data from the Tabula Muris Senis dataset. The experiment was conducted on eight distinct organs, and the errors were computed as the mean L1 error across various cell types in each organ. a,c) show the results for the Smart-seq2 reference and 10x Chromium pseudobulk. b,d) show the results for the 10x Chromium pseudobulk and Smart-seq2 pseudobulk. In the violin plots, the distribution of errors for each evaluated method is presented, with white dots indicating the mean error. The grid plots use colors to indicate the difference between the mean errors of the different methods in that organ, with darker reds indicating relatively poorer performance. These visualizations allow for easy comparison of the performance of different methods across different organs and experimental conditions.

Figure 2. The figure displays the error distribution for each method in the pseudobulk experiment utilizing data from the Tabula Muris Senis dataset. The experiment was conducted on eight distinct organs, and the errors were computed as the mean L1 error across various cell types in each organ. a,c) show the results for the Smart-seq2 reference and 10x Chromium pseudobulk. b,d) show the results for the 10x Chromium pseudobulk and Smart-seq2 pseudobulk. In the violin plots, the distribution of errors for each evaluated method is presented, with white dots indicating the mean error. The grid plots use colors to indicate the difference between the mean errors of the different methods in that organ, with darker reds indicating relatively poorer performance. These visualizations allow for easy comparison of the performance of different methods across different organs and experimental conditions.

Given the many factors that need to be considered, including different algorithms, tissues, cell types, and experimental protocols, any benchmark assessment of cell type deconvolution methods must include a large number of combinations. Furthermore, due to the randomness of the data makes it virtually impossible for any one algorithm to outperform the others in all cases, as discussed in Menden.^[¹⁴^] As a result, we suggest that evaluations should focus on a composite measure of accuracy across multiple cases.

In the data processing step of the SCROAM method, the KMM algorithm is utilized for data transformation. The performance of the deconvolution results can be improved or somewhat reduced depending on the dataset. If the gap between the single-cell reference and bulk data used is significant, the KMM algorithm can greatly enhance the performance. To better understand the effect of the data transformation step, we analyzed and displayed the Large Intestine organ dataset (with Smart-seq2 as the reference). As shown in Figure 3A, the data transformed by KMM yielded a lower error rate than the data without transformation. Furthermore, as shown in Figure 3B, we utilized the Jensen–Shannon Divergence (JSD)^[³³^] indicator to measure the distance between the two distributions. It can be observed that the distance between the bulk data and the cells in the single-cell reference was significantly smaller after data transformation compared to the original bulk data. In Figure 3C, we analyzed each sample of the Large Intestine organ and assessed the correlation between samples using the Pearson correlation coefficient (PCC).^[³⁴^] The results showed that applying deconvolution on the Large Intestine data improved the overall evaluation results. These findings suggest that data transformation using the KMM algorithm can significantly improve the performance of SCROAM, particularly when there is a large discrepancy between the single-cell reference and bulk data.

View Image - Figure 3. depicts the results of the Large Intestine organ dataset using Smart-seq2 as a reference. a) The comparison of results before and after data transformation is shown, indicating that the data transformed by KMM resulted in lower error rates. b) The distance between the raw bulk data and the single-cell reference is compared with the distance between the transformed data and the single-cell reference. The value in each box represents the JSD distance between the sample and the cell. The results show that the distance between the transformed data and single-cell reference is significantly smaller than that of the raw bulk data, highlighting the effectiveness of the KMM data transformation step. c) The deconvolution analysis results for each sample are presented, demonstrating that the results using transformed data are generally higher than those without transformation.

Figure 3. depicts the results of the Large Intestine organ dataset using Smart-seq2 as a reference. a) The comparison of results before and after data transformation is shown, indicating that the data transformed by KMM resulted in lower error rates. b) The distance between the raw bulk data and the single-cell reference is compared with the distance between the transformed data and the single-cell reference. The value in each box represents the JSD distance between the sample and the cell. The results show that the distance between the transformed data and single-cell reference is significantly smaller than that of the raw bulk data, highlighting the effectiveness of the KMM data transformation step. c) The deconvolution analysis results for each sample are presented, demonstrating that the results using transformed data are generally higher than those without transformation.

Validation with Real Bulk RNA-Seq Data

In rare cases, when experimental cell type proportions or bulk RNA-seq samples are available, it becomes feasible to evaluate the real-world performance of cell type deconvolution methods. In this study, we considered three such datasets. The first dataset was derived from the bulk RNA-seq mixture of two human breast cancer cell lines and fibroblasts (60% MDA-MB-468, 30% MCF-7, 10% fibroblasts). To validate our results, we utilized scRNA-seq data published by Dong (Figure 4A).^[¹⁷^] Figure 4B illustrates that SCROAM yielded highly precise results, displaying the lowest mean error out of all evaluated methods. In contrast, the other assessed methods significantly overestimated the bulk fraction of the MDA-MB-468, while underestimating the fraction of the MCF-7 cell line, with the exception of NNLS. These findings suggest that SCROAM is capable of accurately deconvolving cell type-specific gene expression from bulk RNA-seq data, even in challenging real-world scenarios, outperforming other commonly used methods.

View Image - Figure 4. The evaluation of each applicable method using data from Dong,[21] which includes known cell type proportions. a) shows the single-cell clustering results and t-SNE visualization of the three cell types in the dataset, MDA-MB-468, MCF-7, and normal fibroblasts, with a ratio of ≈6:3:1. b) The benchmark of deconvolution results for bulk RNA-seq samples generated by different methods is presented. The proportion estimated by SCROAM has the lowest Mean L1 errors (2.7) to the ground truth, indicating superior accuracy in estimating cell type proportions.

Figure 4. The evaluation of each applicable method using data from Dong,[21] which includes known cell type proportions. a) shows the single-cell clustering results and t-SNE visualization of the three cell types in the dataset, MDA-MB-468, MCF-7, and normal fibroblasts, with a ratio of ≈6:3:1. b) The benchmark of deconvolution results for bulk RNA-seq samples generated by different methods is presented. The proportion estimated by SCROAM has the lowest Mean L1 errors (2.7) to the ground truth, indicating superior accuracy in estimating cell type proportions.

To validate the accuracy of heterogeneous expression measurements in peripheral blood mononuclear cells (PBMCs), we utilized a larger dataset consisting of two sets of 12 bulk whole blood samples from Newman^[²⁰^] and Monaco,^[²²^] respectively. The cell types contained in these bulk samples are: B cells, CD4+ T cells, CD8+ T cells, monocytes, natural killer (NK) cells, and neutrophils granulocytes. For deconvolution analysis, we acquired two distinct scRNA-seq PBMC reference datasets. The first dataset was from Newman^[²⁰^] for Newman bulk tissue samples, while the second dataset was a combination of two publicly available reference sets from 10x Genomics for Monaco bulk samples. Notably, since neutrophils are challenging to measure at the single-cell level, resulting in their absence from the majority of the single-cell references. Here we refer to Erdmann–Pham's^[³⁵^] approach.

To account for their presence in bulk samples, we merged a publicly available scRNA-seq^[²¹^] dataset containing human neutrophils into the two single-cell references. To account for the limited number of cells in the Newman reference dataset, we undertook subsampling of the neutrophil data to 250 cells. In contrast, we sampled 1250 neutrophils for the reference set in the 10x Genomics PBMC reference dataset.

Our deconvolution analysis showed that SCROAM outperformed all other methods, based on the mean absolute deviation (L1 error) of the two analyses. Table 2 summarizes the results of the analysis. Overall, our findings validate the accuracy of heterogeneous expression measurements in PBMCs and demonstrate the superiority of SCROAM in accurately deconvoluting bulk RNA-seq data into its constituent cell types.

Table 2 Average L1 errors with PBMC data and ground-truth cell proportions obtained from cytometry.

	Estimated proportions [%]
Method	Aggregate	Newman dataset	Monaco dataset
SCROAM	14.2	14.7	13.7
SCDC	19.3	17.2	21.3
NNLS	25.2	27.7	22.7
Bisque	n/a	n/a	17.3
MuSiC	n/a	n/a	22.7

The average L1 errors of the two PBMC datasets are presented in the first two columns, while the last column summarizes the L1 errors of the two datasets. The best performing method is indicated in bold. It is worth noting that Bisque and MuSiC were unable to provide scale estimates for the Newman^[²⁰^] data due to the absence of multiple samples of all reference cell types.

Analysis of Mouse Spinal Cord Injury Data

In this study, the actual bulk RNA-seq data acquired from the neural stem region of the mouse spinal cord was analyzed using SCROAM in this study. We first performed scRNA-seq data and bulk RNA-seq data on the identical tissue and used the Seurat^[³⁶^] pipeline for single-cell clustering, t-SNE visualization of the resulting clusters is presented in Figure 5A (see the Experimental Section for details). Through cluster annotation analysis, we identified six distinct cell types: Dormant-Neural Stem Cells (dNSCs), Primed-Neural Stem Cells early (pNSCs_early), Primed-Neural Stem Cells late (pNSCs_late), Active Neural Stem Cells (aNSCs), Transit-Amplifying Progenitors (TAPs), and Neuroblast (NB).

View Image - Figure 5. Each applicable method was evaluated using data from the neural stem region of the mouse spinal cord. a) Following single-cell clustering, t-SNE visualization was generated, revealing six clusters: d_qNSCs, p_qNSCs_early, p_qNSCs_late, aNSCs, TAPs, NB. b) In the benchmarking of deconvolution results on bulk samples generated by different methods, SCROAM was observed to provide the most accurate estimation of the actual biological proportions among all the benchmarked methods.

Figure 5. Each applicable method was evaluated using data from the neural stem region of the mouse spinal cord. a) Following single-cell clustering, t-SNE visualization was generated, revealing six clusters: d_qNSCs, p_qNSCs_early, p_qNSCs_late, aNSCs, TAPs, NB. b) In the benchmarking of deconvolution results on bulk samples generated by different methods, SCROAM was observed to provide the most accurate estimation of the actual biological proportions among all the benchmarked methods.

Next, we employed annotated single-cell reference data for deconvoluting bulk RNA-seq samples. SCROAM largely follows the w-NNLS framework suggested by MuSiC,^[¹⁶^] but it varies in several respects. First, SCROAM reduces the difference between single-cell sequencing and bulk sequencing through KMM^[²⁷^] conversion. Second, when constructing a cell type-specific expression matrix, SCROAM determines specificity by calculating the cosine similarity between genes and calculating gene-specific scores so that genes with smaller scores contribute less to the composition estimate. On this real dataset, we demonstrated that the SCROAM was superior to other methods in accurately determining the proportions of the six cell types present in the bulk sample (Figure 5B). Additionally, our method's predictions for each cell type and their respective proportions align more closely with the biological context. The NNLS method only estimated two cell types, while the MuSiC and SCDC methods estimated three cell types. Although the Bisque method was able to deconvolute each cell type, the bulk sample used was from adult normal spinal cord cells, where there are more quiescent cells. The SCROAM approach better reflects this biological reality, demonstrating its utility in accurately deconvoluting similar cell type mixtures from bulk RNA-seq data, emphasizing its utility. These findings highlight the potential of SCROAM as a valuable tool for analyzing bulk RNA-seq data in various biological contexts.

Case Study1: Cell Fraction of HCC TME Correlates with Clinical Outcome

The aim of this study was to analyze the tumor microenvironment (TME) in liver cancer by estimating cellular components through transcription using two single-cell RNA-seq reference: 1) GSE115469^[³⁷^] (Normal Liver), and 2) GSE146409^[³⁸^] (TME-Stroma). GSE115469 represents a liver subgroup in the Human Cell Atlas, while GSE146409 includes the TME of liver cancers, such as Hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA), and describes over 20 cell types or subtypes. The expression matrices in both datasets are normalized to 10 000 counts per cell and are provided in H5AD file format, along with metadata annotations by the authors for downstream analysis.

Our study aimed to determine if changes in liver cells have an impact on the clinical prognosis of HCC. To achieve this, we conducted survival analysis using the TCGA-LIHC database, the most cited HCC research cohort containing a sizable sample of 370 HCC patients with well-annotated follow-up information. This database is a pooled study of five different cohorts with varying risk factors, and it includes survival analyses such as overall survival (OS).^[³⁹^] Among all estimated cell types, Liver sinusoidal endothelial cells (LSECs) and cholangiocytes had a significant impact on patient prognosis. A high fraction of LSEC cells was associated with longer OS (Figure 6A), while patients with a high proportion of cholangiocytes had a lower OS (Figure 6B). These findings are consistent with previous reports.^[⁴⁰^]

$View Image - Figure 6. Effect of cell ratio on patient survival. a) Effect of LSEC cell fraction on overall survival (OS), with patients exhibiting high levels of LSEC cells having longer survival times. b) The effect of cholangiocyte fraction on OS, with patients having a high proportion of cholangiocytes associated with a lower OS.$

Figure 6. Effect of cell ratio on patient survival. a) Effect of LSEC cell fraction on overall survival (OS), with patients exhibiting high levels of LSEC cells having longer survival times. b) The effect of cholangiocyte fraction on OS, with patients having a high proportion of cholangiocytes associated with a lower OS.

In conclusion, our results suggest that both LSECs and cholangiocytes are crucial in determining the clinical prognosis of HCC. The findings of this study have important implications for understanding the biology of liver cancer and developing new therapeutic strategies.

Case Study2: Tumor Microenvironment Infltration Estimation

In this study, we analyzed cell type proportions in 169 glioblastoma (GBM) samples obtained from TCGA.^[⁴¹^] To ensure the accuracy of our results, we employed scRNA-seq data obtained from the same tumor type to perform the deconvolution task.^[⁴²^] Using these reference sets, we identified six cell types of GBM (Figure 7A).

$View Image - Figure 7. Relationship between cell status and prognosis of non-malignant cells in various tumor types from the TCGA cohort. a) Violin plot visualizing the distribution of cell type fractions in each tumor type. The median is represented by a white dot and the upper and lower quartiles are represented by bars. b,c) The association between oligodendrocyte(b) and pericyte(c) infiltration with survival in GBM using Kaplan–Meier plots.$

Figure 7. Relationship between cell status and prognosis of non-malignant cells in various tumor types from the TCGA cohort. a) Violin plot visualizing the distribution of cell type fractions in each tumor type. The median is represented by a white dot and the upper and lower quartiles are represented by bars. b,c) The association between oligodendrocyte(b) and pericyte(c) infiltration with survival in GBM using Kaplan–Meier plots.

We also investigated whether there was a correlation between non-malignant cell types and patient survival. Since the TCGA cohort samples varied substantially with respect to treatment, genetic drivers, and other confounding factors, we excluded samples with genetic or clinical covariates, such as IDH-mutant GBMs, which have well-documented and substantial effects on prognosis. To assess the relationship between cell type abundance and patient survival, we utilized two Cox proportional hazards models. To stratify the samples based on cell type abundance, we utilized the cutoff package^[⁴³^] which allowed us to classify the samples into high and low abundance groups. To assess the relationship between cell type and patient survival, we used cell type abundance as a variable value in our analysis.

Our analysis revealed some important relationships between various immune cell types and clinical outcomes. Our study of GBM revealed that oligodendrocytes(Oligo) and pericyte cells exhibited a stronger correlation with patient survival rate (Figure 7B,C), which is consistent with previous reports.^[⁴⁴^] The results of our study emphasize the significance of comprehensively understanding the TME of GBM and may have implications for the development of novel therapeutic strategies targeting non-malignant cell types. Overall, our study demonstrates the utility of SCROAM in analyzing bulk RNA-seq to unravel the complex cellular composition of tumors and identify potential prognostic biomarkers.

Conclusion

Accurate identification of cellular composition in complex tissues is crucial for understanding disease pathogenesis, early diagnosis, and prevention. The high cost and technical noise associated with scRNA-seq can limit its widespread use, despite its ability to offer valuable insights into cellular composition. Consequently, cost-effective alternatives to deconvolution of bulk RNA-seq are necessary. However, most statistical and computational methods have been developed for cell-type deconvolution of bulk RNA-seq, many of these methods have limitations. For example, some methods necessitate prior knowledge of the gene expression profile or cell type composition of purified cell types. There are also methods that rely on a pre-selected list of marker genes. Moreover, fully unsupervised deconvolution methods based on non-negative matrix factorization may suffer from several limitations, such as low precision, discriminability, and multicollinearity issues. Although cell type decomposition on bulk RNA-seq data using scRNA-seq references is an appealing analysis strategy, it is important to note that there may be significant differences in data distribution between scRNA-seq and bulk RNA-seq data, and unreliable feature matrices generated from single-cell references can lead to the identification of incorrect cell types.

To address these challenges and make better use of existing data, we propose SCROAM, a method that utilizes single-cell references and domain-adaptive matching to deconvolute bulk RNA-seq data. By utilizing scRNA-seq references to create reference cell type expression profiles, we eliminate the need for marker gene selection, thereby improving computational efficiency. We calculate the similarity between the true and ideal expression of each gene in the cellular dimension as its specific contribution. Additionally, by reducing the dimensionality of scRNA-seq and bulk RNA-seq to a common feature space, we use domain adaptation matching to learn the difference between the two data in the latent space. After continuous iteration, we learn a weight that eliminates the latent space difference. When significant technical disparities exist between single-cell reference data and observed bulk data, the performance of decomposition can be significantly improved.

To evaluate our method, we benchmarked existing methods using generated pseudo-batch datasets and real experimental datasets with known cell type composition as ground truth. Our results show that SCROAM outperforms existing methods in both settings, providing a more accurate breakdown of cell types. We also applied SCROAM to a mouse spinal cord dataset, where our method successfully identified continuous cell types consistent with biological facts. This analysis further demonstrates the robustness and accuracy of our method in deconvoluting complex tissue samples into their constituent cell types. Finally, we validated our method on liver cancer and tumor datasets and discovered cell types associated with patient prognosis, highlighting the potential of deconvolution to reveal cellular heterogeneity in complex biological systems. Our study demonstrates the versatility of SCROAM and its potential in addressing diverse biological questions of interest. Our approach could provide valuable insights into changes in cell proportions in diseased tissues, informing subsequent therapeutic strategies.

Acknowledgements

This research was funded by National Natural Science Foundation of China, grant number 62072353 and 62272065. L.Y. and C.Z. initiated and envisioned the study. X.G., Z.H., and L.Y. formulated the model. X.G. was responsible for implementing the algorithm and performing simulation studies in this research. Z.H and F.J. performed a baseline comparison of human breast cancer and peripheral blood. C.Z. and F.J. conceived and carried out the biological experiments. All authors participated in the real data analysis presented in this paper. X.G. and L.Y. were responsible for writing the manuscript, which was subsequently reviewed, edited, and approved by all authors.

Conflict of Interest

The authors declare no conflict of interest.

Data Availability Statement

The data that support the findings of this study are available in the supplementary material of this article.;

Word count: 6999

Show less

© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Accurately identifies the cellular composition of complex tissues, which is critical for understanding disease pathogenesis, early diagnosis, and prevention. However, current methods for deconvoluting bulk RNA sequencing (RNA-seq) typically rely on matched single-cell RNA sequencing (scRNA-seq) as a reference, which can be limiting due to differences in sequencing distribution and the potential for invalid information from single-cell references. Hence, a novel computational method named SCROAM is introduced to address these challenges. SCROAM transforms scRNA-seq and bulk RNA-seq into a shared feature space, effectively eliminating distributional differences in the latent space. Subsequently, cell-type-specific expression matrices are generated from the scRNA-seq data, facilitating the precise identification of cell types within bulk tissues. The performance of SCROAM is assessed through benchmarking against simulated and real datasets, demonstrating its accuracy and robustness. To further validate SCROAM's performance, single-cell and bulk RNA-seq experiments are conducted on mouse spinal cord tissue, with SCROAM applied to identify cell types in bulk tissue. Results indicate that SCROAM is a highly effective tool for identifying similar cell types. An integrated analysis of liver cancer and primary glioblastoma is then performed. Overall, this research offers a novel perspective for delivering precise insights into disease pathogenesis and potential therapeutic strategies.

Details

Title

Highly Accurate Estimation of Cell Type Abundance in Bulk Tissues Based on Single-Cell Reference and Domain Adaptive Matching

Author

Guo, Xinyang¹; Huang, Zhaoyang¹; Ju, Fen²; Zhao, Chenguang²; Yu, Liang¹

¹ School of Computer Science and Technology, Xidian University, Xi'an, China
² Department of Rehabilitation Medicine, Xijing Hospital, Fourth Military Medical University, Xi'an, China

Section

Research Articles

Publication year

2024

Publication date

Feb 2024

Publisher

John Wiley & Sons, Inc.

e-ISSN

21983844

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/advs.202306329

ProQuest document ID

2927058306

Highly Accurate Estimation of Cell Type Abundance in Bulk Tissues Based on Single-Cell Reference and Domain Adaptive Matching

Jump to:

Full Text

Abstract

Details

Suggested sources