INTRODUCTION
A strong link between gut microbial dysbiosis and colorectal cancer (CRC) has been repeatedly reported in the last years (1, 2). The term “oncobiome” has been used to describe this close relationship between the microbiome and cancer. In particular, the oncobiome relates to microbes that can cause cell transformation by affecting genome stability, resistance to cell death, and proliferative signaling (3). Specific microbial signatures have been associated with CRC (2, 4–6). Besides bacteria such as
Importantly, host and gut microbiome interactions are mediated by proteins, metabolites, and small RNAs (sRNAs), including
Microbial sRNAs play a critical role in microbial gene regulatory networks (16), but thus far, their involvement in host-microbiome interactions has been less extensively explored. In addition to the human sRNAs which can affect microbial gene expression (9), these molecules may contribute to an interkingdom RNA regulatory network by modulating microbial and human gene expression levels (17, 18). Many microbial RNA sequences can be detected in transcriptomic experiments performed on human samples (19, 20). However, their characterization remains challenging because the short length of sRNA transcripts (50 to 200 nucleotides [nt]) hampers their identification and quantification, especially when the microbial genomic sequence, usable as a reference, is not available. In this sense, the integration of small RNA sequencing (sRNA-Seq) and whole-metagenome sequencing profiles (21) could make the identification of microbial sRNAs more reliable and informative in searches of complementary cancer biomarkers.
In the present study, we sequenced stool samples collected from 80 subjects (24 healthy individuals, 27 adenomas, and 29 CRC) in a hospital-based case-control study. By integrating sRNA-Seq and metagenomic analyses, we provided evidence of extensive overlap of microbial populations detected by the two approaches with the identification of abundant microbial sRNAs in carcinoma patients. Finally, we reported a combination of DNA and RNA signatures which is able to accurately classify CRC patients with respect to precancerous lesions and healthy controls.
RESULTS
Differences in fecal microbiome composition among healthy, adenoma, and CRC patients.
The study included 80 subjects who provided stool samples before undergoing colonoscopy. DNA and RNA fractions were extracted from each fecal sample for metagenomic and sRNA-Seq (5, 22) (see Materials and Methods).
The taxonomic characterization of the microbiome revealed 427 bacterial species belonging to 11 phyla (see Table S1A [https://doi.org/10.6084/m9.figshare.8078630]). The most abundant bacterial species in both healthy and adenoma samples was
At the phylum level, the abundances of Proteobacteria and Verrucomicrobia in adenoma patients were significantly different from those seen in both the healthy and cancer groups. In particular, Verrucomicrobia showed the highest abundance whereas Proteobacteria showed intermediate abundance in the adenoma group (see Table S1A). Firmicutes was the most significantly abundant phylum in the carcinoma group compared with both the healthy and adenoma groups (Fig. 1A; see also Table S1A).
FIG 1
(A) Violin plots reporting the relative abundances of differentially abundant bacterial phyla among healthy, adenoma, and carcinoma groups from metagenomic data analysis. P values were computed using the Wilcoxon rank sum test and adjusted using the Benjamini-Hochberg method. **, adjusted P value < 0.01; *, adjusted P value < 0.05. (B) Violin plots reporting the relative abundances of
At the species level, a significant increase in the levels of
The spectrum of fecal human sncRNAs.
The analysis of human and microbial sRNA expression in stool samples was performed by applying a modified sRNA-Seq analysis pipeline developed by our group (22, 23). Starting from an average of 9.3 million single-end reads per sample, a median of 5.3% of the reads were assigned to human sRNA annotations (Fig. 1C; see also Table S2 [https://doi.org/10.6084/m9.figshare.8078633]).
Collectively, 141 hsa-miRNAs and 973 hsa-sncRNAs were associated with at least one sample group (with a median of at least 20 reads). Overall, comparing healthy to adenoma subjects, adenoma to carcinoma subjects, and healthy to carcinoma subjects, 14, 54, and 87 differentially expressed hsa-miRNAs were identified, respectively (see Fig. S1A [https://doi.org/10.6084/m9.figshare.8081003] and Table S3A [https://doi.org/10.6084/m9.figshare.8078627]). Furthermore, 70, 47, and 4 hsa-sncRNAs (other than hsa-miRNAs) were dysregulated in the comparisons described above, respectively (see Fig. S1B and Table S3B). Among the differentially expressed sncRNAs, tRNAs were the most highly represented biotype (55.4%), followed by Piwi-interacting RNAs (piRNAs) (31.3%).
The hsa-miRNA and hsa-sncRNA found to be most significantly differently expressed in the comparison between cancer and healthy groups were hsa-miR-378a-3p (adjusted P value, <0.0001) and hsa-piR-11481 (adjusted P value, <0.02), respectively (Fig. 1D).
Focusing on hsa-miRNAs only, two annotations (hsa-miR-200b-3p and hsa-miR-6738-5p) were detected as significantly dysregulated in all the comparisons (see Fig. S1C). The hsa-miR-200b-3p expression levels progressively increased whereas those of hsa-miR-6738-5p gradually decreased going from the healthy to the adenoma to the CRC group (see Fig. S1D).
In addition to the two hsa-miRNAs that were found significantly altered in all the comparisons, in the adenoma group, we also identified hsa-miR-30c-5p, which was the hsa-miRNA most highly differentially expressed in this group compared with the other two. In contrast, 43 differentially expressed hsa-miRNAs characterized the CRC group. A total of 5 and 33 altered hsa-miRNAs showed expression trends of gradual increases and decreases, respectively. Finally, five differentially expressed hsa-miRNAs were characterized by the highest expression in the adenoma samples and showed slightly decreased expression in the cancer group (see Table S3C).
Microbial sRNAs in human stool samples.
All the reads previously not aligned to hsa-miRNAs and human sRNAs were further mapped against the human genome to identify those derived from human RNAs. On average, 61.9% of the input reads did not align to any human annotation (see Table S2). A tiny fraction of the remaining reads (average proportion of mapped reads = 0.004%) were aligned against miRNAs derived from animals or plants which can be commonly found in the Western diet. Their representation did not differ among the healthy, adenoma, and CRC subjects (average proportions of mapped reads, 0.004%, 0.004%, and 0.005%, respectively); therefore, we did not proceed further with the analyses in this direction.
Considering annotations of bacteria, archaea, and viruses in NCBI, an average of 18.93% of nonmapped human reads were aligned to microbial genome sequences, with the highest prevalence of reads assigned to bacteria (on average, 99.96% of the assigned reads) followed by archaea (average, 0.03%) and viruses (average, 0.03%) (see Table S2). Thus, we focused our analysis on the reads assigned to bacteria genomes to define a bacterial sRNA (bsRNA)-based profile (Fig. 1C; see also Table S1A to C).
The bsRNA reads were used to perform two types of analysis (Fig. 2A). The first analysis was based on the refinement of the bacterial profile obtained from the metagenomic data and the second on the identification of the differentially expressed bsRNAs. The information provided by the metagenomic experiments was integrated with that provided by the analysis of bsRNAs from sRNA-Seq data to improve the profiling of the bacteria. The consistency of the bsRNA-based profiles was verified by comparing the relative abundances of bacteria at each taxonomic level obtained from the metagenomic analysis (Fig. 2B) (see also Fig. S2A and B [https://doi.org/10.6084/m9.figshare.8080997] and Table S4A and B [https://doi.org/10.6084/m9.figshare.8080988] and Materials and Methods for more details). Considering all taxonomic levels, 50.2% of the annotations (n = 130) detected from DNA and RNA were significantly correlated (adjusted P value, <0.05) and all of them were associated with a positive correlation (median r = 0.64) (see Table S4A).
FIG 2
(A) Flow chart summarizing the analyses performed to identify and analyze the sRNA-Seq reads assigned to microbial genomes and bsRNAs. WMS, whole-metagenome sequencing. (B) Stacked bar plots reporting the relative abundances of bacterial phyla detected using whole-metagenome sequencing (top) and small RNA sequencing (sRNA-Seq; bottom) data, respectively. (C) Bar plot reporting the Pearson correlation coefficient (r) computed from comparisons between metagenomic and sRNA-Seq data for each phylum. (D) Heat map reporting the log2 ratios of relative abundances between bsRNAs and bacterial DNA profiles. Only ratios of bacterial species with median greater than 1 are reported. P.aeu,
At the species level,
Considering the relative abundances from the bsRNA profiles, 15 species were significantly altered in the CRC group compared to the healthy group (see Table S1B). We confirmed that
The analysis of the content of local secondary structures in the reads of our sRNA-Seq data set showed that the reads assigned to bacteria had an overall lower minimum free energy (MFE) level than those assigned to humans (P value, <0.0001) (see Table S5A and B [https://doi.org/10.6084/m9.figshare.8080976]). That outcome is consistent with previous reports and observations of an increase in local secondary structures in bacterial RNAs (24). To further evaluate the nature of the bsRNAs detected by us, we quantified 21,088 nonredundant bsRNA annotations from RNA Central v12 (RNA Central Consortium 2018) (see Table S6A [https://doi.org/10.6084/m9.figshare.8080985]). Among them, 88.50% (n = 18,664) were annotated in RNA Central to only one bacterial species, particularly to
We identified 450 differentially expressed bsRNAs (adjusted P value < 0.05, median reads > 20) among the groups, and the highest number of bsRNAs (n = 419) was detected in comparison of the healthy and carcinoma groups (see Table S6B). The majority (n = 176, 42.0%) of these sRNAs were annotated to
At this stage, we have verified whether other species-specific bsRNAs were characterized by dysregulated expression levels between sample groups. To reach this scope, we quantified the expression levels of bsRNAs annotated in the Bacteria Small RNA Database (BSRD) (25) and performed differential expression analyses among the groups. We identified 18 bsRNAs that were differentially expressed between the CRC and healthy control groups (Fig. 2D; see also Table S6C).
To complete the bsRNA analysis, we computed the relative abundances (ratios) of DNA and bsRNA (Fig. 2E; see also Table S4D). Fourteen species, mainly from the Bacteroidetes phylum, were characterized by a high transcription rate (high bsRNA/DNA ratio). Conversely,
Specific signaling pathways targeted by hsa-miRNAs are correlated with
Given the observed increased abundance of
FIG 3
(A) Bar plot reporting the list of hsa-miRNAs and hsa-sncRNAs correlated with the abundance of
To investigate the role of the 13 identified hsa-miRNAs in more detail, we performed a functional enrichment analysis of the human genes targeted by those hsa-miRNAs using miRPathDB (26). As expected, the most highly represented KEGG term was “microRNA in cancer,” which was enriched in the targets of six differentially expressed hsa-miRNAs (Fig. 3B). Interestingly, three of them (hsa-miR-423-5p, hsa-miR-92a-3p, and hsa-let-7b-5p) were also annotated to the KEGG term “Pathogenic
Using the WikiPathways annotations, the terms “miRNA Regulation of DNA Damage Response” and “miRNAs involved in DNA damage response” also emerged as statistically enriched in four other differentially expressed hsa-miRNAs (i.e., hsa-let-7a-5p, hsa-let-7b-5p, hsa-miR-26a-5p, and hsa-miR-7g-5p) correlated with the
The combined use of human and microbial sRNAs improves classification.
Finally, we tested a combination of transcriptomic and genomic profiles to classify the recruited subjects according to disease status. The Random Forest classification approach (4) provided high accuracy in classifying CRC cases and controls in unbiased cross-validation (area under the curve [AUC] = 0.86) in considering the profiles of both human and bacterial sRNA (hsa-miRNAs + bsRNAs) (see Fig. S4A and B [https://doi.org/10.6084/m9.figshare.8081006]). Conversely, the use of microbial DNA profiles alone showed lower potential with respect to classifying samples (AUC = 0.65). Note that the use of hsa-miRNA, bsRNA, and microbial DNA profiles provided the best classification performance for CRC versus controls (AUC = 0.87) and CRC versus adenomas (AUC = 0.74) (Fig. 4A). However, this signature was not able to distinguish between adenoma and controls samples (AUC = ∼0.5).
FIG 4
(A) Heat map reporting the area under the curve (AUC) computed by the Random Forest classifier using bacterial relative abundances provided by metagenomic data (bDNA), sRNA-Seq data (bsRNAs), and the combination of both bDNA and bsRNAs and combined with the expression levels of hsa-miRNAs (hsa-miRNAs + bDNA + bsRNAs). (B) Line plot reporting the AUC computed by the Random Forest classifier. For the classification of carcinoma and healthy samples, a specific number of features from different input information is reported. Rankings are obtained excluding testing set to avoid overfitting issues. (C) Bar plot reporting the average classification contribution of each of the 32 features providing the best classification accuracy of cancer and healthy samples. All the reported bacterial features were obtained using sRNA-Seq.
A signature of 32 features that included both hsa-miRNAs and bsRNA profiles from sRNA-Seq data provided the best differentiation of CRC subjects from healthy controls (Fig. 4B and C). The signature was composed of 57.7% human miRNAs and 42.4% microbial signals (Fig. 4C).
DISCUSSION
The reproducibility of CRC-associated gut microbiome signatures among different populations may allow the development of new accurate oncobiome-based diagnostic models, as recently demonstrated by us (5, 6). However, a more accurate patient stratification may be obtained, considering not only the microbiome composition based on DNA but also the other biological molecules that can be retrieved in stool samples. In this study, we investigated the composition of human and microbial sRNAs in fecal samples from healthy subjects and patients with colorectal adenoma or carcinoma and we compared the different profiles among these groups. We showed that bsRNA profiles reflect the differences in the microbial DNA profiles characterizing each group. Moreover, our study demonstrated that a specific signature composed of profiles of human miRNAs, microbial sRNAs, and microbial DNAs was able to accurately classify the three groups of subjects with a high level of performance (AUC = 0.87), evidence that would be worth confirming across multiple cohorts.
In our analysis, the Firmicutes abundances characterizing cancer samples were significantly different from those characterizing either the healthy or the adenoma group. Interestingly, the Verrucomicrobia phylum, characterized by a significant peak of expression in the adenoma samples, may represent a potential candidate biomarker for precancer lesions. A significant increase in the level of the Fusobacteria phylum was also observed in the carcinoma group compared to the healthy and adenoma groups, as previously reported (5–7, 28–30). The abundance of
In recent years, researchers have focused on profiling microbial DNA or RNA in stool samples, marginally considering the microbial small RNA counterpart. In this respect, we have analyzed the nonmapped human reads from small RNA-Seq of stool samples in relation to metagenomic profiling. We observed that the microbial transcriptomic profiling was highly concordant with the metagenomic data, supporting the idea of combined use of the two deep-sequencing techniques. However, the current lack of an exhaustive catalog of sRNA annotations in nonmodel organisms forced us to restrict our microbial sRNA quantification only to a subset of well-known microbes. In fact, the majority of the identified bsRNAs differentially expressed between the subject groups were annotated to
Our results showed that many
Interestingly, a subset of genes targeted by upregulated hsa-miRNAs in CRC subjects codes for proteins interacting with specific
Evidence of a role of hsa-miRNAs in the modulation of the immune response to specific pathogens was reported previously (10), and an altered form of hsa-miRNA expression that was detected in in vivo and in vitro models of
Another functional annotation enriched in the gene targets of hsa-miRNAs correlated with
The differential expression analysis suggested hsa-miR-30-5p as a candidate biomarker for adenomas given its peak of expression in this group. Similarly, the five hsa-miRNAs significantly upregulated in the CRC group compared to the adenoma and healthy groups (hsa-miR-21-5p, hsa-miR-200b-3p, hsa-miR-1290-5p, hsa-miR-4792-3p, and hsa-miR-1246-3p) could be considered promising CRC biomarkers.
Another critical aspect to consider is whether stool data reflect the sRNA expression profile and microbiome composition differences between the CRC primary tissue and the adjacent normal colonic mucosa. We compared the 19 hsa-miRNAs belonging to our signature defined by the Random Forest classifier with seven publicly available data sets of hsa-miRNAs differentially expressed between CRC tissues and matched adjacent colonic mucosa (see Table S8A [https://doi.org/10.6084/m9.figshare.8080979]). All the overlaps were statistically significant and involved from 4 to 12 hsa-miRNAs of our signature. Hsa-miR-21-5p was the only example from our signature detected as differentially expressed in all the compared data sets, followed by hsa-miR-378a-3p (differentially expressed in six data sets) and hsa-miR-215-5p (differentially expressed in five data sets) (see Table S8B). Considering also public annotations of hsa-miRNAs dysregulated in CRC primary tissues (miRCancer database), the hairpin sequences of 13 mature hsa-miRNAs belonging to the Random Forest signature were detected as annotated to CRC disease in this database (see Table S8C).
Yuan and colleagues performed both miRNA profiling and 16S RNA profiling of 44 CRC and paired healthy tissues (12). The comparison of these data with the hsa-miRNAs of our signature shows an overlap of four hsa-miRNAs (hsa-miR-21-5p, hsa-miR-148a-3p, hsa-miR-378a-3p, and hsa-miR-98-5p) detected as dysregulated in both studies (see Table S8A and B). However, in contrast to the study of Yuan and colleagues, we found a significant increase in
The results presented here took advantage of the use of dual pools of information provided by the concomitant DNA and sRNA sequencing of the same stool samples. This experimental setting provided us an overview of the microbial population and their activities. Considering the CRC microbiome profiles annotated in the Disbiome database (53), we obtained evidence of a relation with CRC for six species characterized by high metagenomic/sRNA-Seq correlation (
Although microbial genomic sequences are becoming exhaustively available, thanks to the diffusion of metagenomic analyses, a comprehensive database of microbial annotations is still needed at the RNA level. In this respect, we aligned the sRNA-Seq reads first to microbial genomes and, subsequently, to RNA Central annotations or BSRD annotations separately. It is important to highlight that only specific microbial species, particularly strains of
From the clinical point of view, our analysis provided further support for the idea of the use of a combination of fecal microbial and human RNA biomarkers to better distinguish subjects with colonic adenoma or carcinoma from healthy individuals. The isolation of RNA from stool samples can provide combinatorial information on the small RNA expression profile of the host and the gut microbiome that can be used for accurate classification of the subjects according to their health status. Validation on a larger independent cohort of patients and healthy controls is mandatory to assess the accuracy of these biomarkers. However, to the best of our knowledge, at the moment, there is no similar study/cohort available that collected RNA and DNA from the same stool samples and performed concomitant sequencing analyses. Our integrative analyses provided further evidence of an altered host-gut microbiome relationship in cancer and suggested an unexplored role of bsRNAs in CRC. Furthermore, the characterization of gut microbiome sRNA molecules could provide a novel opportunity to improve CRC treatments in light of the role of microbiota in the modulation of the patient response to cancer chemotherapy and immunotherapy (56, 57).
MATERIALS AND METHODS
Study population.
Samples were collected from patients recruited in a hospital-based study at the Clinica S. Rita in Vercelli, Italy. On the basis of colonoscopy results, participants were classified into three categories: (i) healthy subjects (individuals with colonoscopy results negative for tumor, adenomas, and inflammatory bowel disease [IBD]); (ii) adenoma patients (individuals with a colorectal adenoma[s]); and (iii) colorectal cancer patients (individuals with newly diagnosed CRC). A total of 80 subjects (29 CRC patients, 27 adenoma, and 24 controls) were included in the present study. CRC patients were recruited at the first CRC diagnosis and had not received any treatment before the fecal sample collection.
The study was approved by the local ethics committee (ethics committee of Azienda Ospedaliera SS. Antonio e Biagio e C. Arrigo of Alessandria, Italy; protocol N. Colorectal miRNA CEC2014), and informed consent was obtained from all participants.
Sample collection.
Naturally evacuated fecal samples were obtained from all patients previously instructed to self-collect the specimen at home before any bowel preparation for colonoscopy. The stool was collected in stool nucleic acid collection and transport tubes with RNA stabilizing solution (Norgen Biotek Corp.) and returned at the time of performing a colonoscopy in the endoscopy unit or at the time of blood sampling. Aliquots (200 ml) of the stool samples were stored at –80°C until RNA/DNA extraction.
Extraction of total RNA and total DNA from stool.
RNA was extracted using a stool total RNA purification kit (Norgen Biotek Corp.). The RNA quality and quantity were verified according to the MIQE guidelines (http://miqe.gene-quantification.info/). The RNA concentration was quantified by Qubit with a Qubit microRNA assay kit (Invitrogen). The DNA extraction was performed with a QIAamp DNA stool minikit (Qiagen, Hilden, Germany) according to the instructions of the manufacturer. Finally, DNA was eluted in 100 μl of the elution buffer provided with the kit. The DNA quantification was performed with a Qubit DNA high-sensitivity (HS) assay kit (Invitrogen).
Analysis of microbiome composition by shotgun sequencing.
Sequencing libraries were prepared using a Nextera XT DNA library preparation kit (Illumina, CA, USA), following the manufacturer’s guidelines and a previously reported protocol (22, 23). Sequencing was performed on a HiSeq 2500 sequencer (Illumina, CA, USA) at the internal sequencing facility of the Centre for Integrative Biology, Trento, Italy. Whole-metagenome sequencing data were analyzed as described previously (5). Analysis of differential abundances between subject groups was performed using the Wilcoxon rank sum test. Only data from species with average abundances higher than 0.1% in at least one group and with Benjamini-Hochberg (BH)-adjusted P values of <0.05 were considered statistically significant.
Library preparation for small RNA sequencing (sRNA-Seq).
sRNA transcripts were converted into barcoded cDNA libraries. Library preparation was performed with a NEBNext multiplex small RNA library prep set for Illumina (protocol E7330; New England BioLabs Inc., USA) as described previously (22). For each sample, 250 ng of RNA was used as the starting material to prepare libraries. Each library was prepared with a unique indexed primer so that the libraries could all be pooled into one sequencing lane. Multiplex adapter ligations, reverse transcription primer hybridization, reverse transcription reactions, and the PCR amplification were performed as described in the protocol provided by the manufacturer. After PCR preamplification, the cDNA constructs were purified with a QIAQuick PCR purification kit (Qiagen, Germany) following the modifications suggested in the NEBNext multiplex small RNA library prep protocol. Further quality control checks and size selections were performed following the NEBNext multiplex small RNA library prep protocol (protocol E7330; New England BioLabs Inc., USA). Size selection of the amplified cDNA constructs was performed using Novex Tris-borate-EDTA (TBE) gels (Invitrogen) (6%) and following the procedure of gel electrophoresis running and purification of the construct described in the Illumina TruSeq small RNA library prep protocol. The 140-nt and 150-nt bands correspond to adapter-ligated constructs derived from RNA fragments of 21 to 30 nt. A concluding Bioanalyzer 2100 run performed with a high-sensitivity DNA kit (Agilent Technologies, Germany) permitted checking final size, purity, and concentration for the sequences in the DNA libraries.
The obtained libraries (24 samples were multiplexed) were subjected to the Illumina sequencing pipeline, passing through clonal cluster generation on a single-read flow cell (Illumina Inc., USA) by bridge amplification on a cBot (TruSeq SR cluster kit, v3-cBOT-HS; Illumina Inc., USA) and 50 cycles of sequencing by synthesis using a HiSeq 2000 sequencer (Illumina Inc., USA) (in collaboration with Genecore Facility at EMBL, Heidelberg, Germany).
Analysis of sRNA from sRNA-seq data.
sRNA-Seq pipeline analyses were performed using a previously described approach (22, 23). Fastq files were quality checked (QC) using FastQC software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Reads shorter than 14 nt were discarded. The QC-passed reads were clipped from the adapter sequences using Cutadapt (58) by imposing a maximum error rate in terms of mismatches, insertions, and deletions equal to 0.15. The length of the raw sRNA-Seq reads was 50 bp, and the reads were, on average, 24.8 bp in length after adapter removal. Trimmed reads were mapped against an in-house reference of human sRNA sequences composed of (i) 1,881 precursor hsa-miRNAs from miRBase v21 (59); (ii) 32,826 hsa-piRNA sequences from piRBase v1.0 (60); and (iii) 5,171 hsa-sncRNA sequences shorter than 80 bp from Database of Small Human noncoding RNAs (DASHR) database v1.0 (61). The alignment was performed using BWA algorithm v. 0.7.12 (62) with the default settings. Using these settings, the seeding was not performed for reads shorter than 32 bp, and the reads were entirely evaluated for the alignment. The hsa-miRNAs were annotated and quantified using two methods called the “knowledge-based” and “position-based” methods. In the “knowledge-based” method, the arm starting positions of the hsa-miRNAs precursor were well defined. On the basis of the position of the mapped reads, it was possible to quantify the mature hsa-miRNAs. Alternatively, the “position-based” method was implemented for those hsa-miRNA precursors in which the positions of the 5′ and 3′ arms are not defined. With the second method, on the basis of the position of the read, the name of hsa-miRNAs was assigned a “5′ Novel” or “3′ Novel” suffix and the data were quantified. The results generated by the annotation and quantification methods were merged into a unique mature hsa-miRNA count matrix.
The quantification of hsa-sncRNAs annotations was performed by counting the read alignment reported by BWA output sam files. The reads that were left unmapped in the annotations of the hsa-miRNAs and hsa-sncRNAs were aligned against the human hg38 genome using BWA with the default settings. The reads that were left unmapped on the human genome were then aligned against annotations of bacterial genomes and sRNAs.
Differential expression analysis was performed with DESeq2 R package v.1.22.2 (63) using the likelihood ratio test (LRT) function. This function was selected in order to correct the analysis, including age and gender as covariates. A gene was defined as differentially expressed if associated with a BH-adjusted P value of less than 0.05 and supported by at least a median number of reads higher than 20 within at least one of the sample groups considered.
Reads that were left unmapped to the human genome were considered for the quantification of annotations of nonhuman sRNAs. Using these data, quantification of diet-related miRNAs was performed by considering 5,293 miRNAs annotated to animal or plant species which are part of a Western diet and may be retrievable in human stool. The annotation of cancer-related hsa-miRNAs retrieved from the miRCancer database (64) was performed using the version from 18 February 2019. Only annotations obtained from studies performed on CRC tissues were considered.
Estimation of microbial sRNA expression levels.
The sRNA-Seq reads that were not aligned were mapped against microbial genomes using the Kraken algorithm (65). Kraken was applied in the default setting by considering bacterial, archaeal, and viral genomes from NCBI. Using these settings, only reads with length ≥31 bp were used for the analysis (average, 30.71% of human unmapped reads). The fraction of reads analyzed by Kraken and the average read length in each sample are reported in Table S2. The Kraken output in MetaPhlAn2 format obtained with the option –mpa-format was used in the analysis. For each taxonomic level, the number of reads assigned to microbial annotations by Kraken was converted into relative abundances using the total number of assigned reads.
Quantification of bsRNAs annotated in RNA Central v12 (66) was performed by considering only bacterial species characterized by an average abundance higher than 0.01% in our metagenomic data. Considering these bacterial species, 286,433 bsRNAs with a length of <80 bp were retrieved from the database and were merged into a set of 21,088 nonredundant sequences (since identical sequences were annotated to different species or strains) (see Table S6A). The expression level of BSRD annotations (25) was quantified using BWA by aligning human-unmapped reads against their sequence.
Differential expression analysis of RNA Central and BSRD annotations was performed using the DESeq2 R package (63).
Comparison and integration of MetaPhlAn2 and Kraken results.
The comparison between MetaPhlAn2 and Kraken relative abundances was performed by considering separately the abundances computed at each taxonomic level. Correlation analysis was performed using the Pearson method and the R function “cor” and correlation P value computed using the “cor.test” R function. Multiple-test correction was performed using the BH method implemented in the “p.adjust” R function. Only correlations associated with a BH-adjusted P value lower than 0.05 were considered significant. The most highly correlated species were identified by considering only the species with an average abundance greater than 0.1% in at least one sample group, a Pearson r higher than 0.6, and a BH-adjusted P value lower than 0.05.
The estimation of bsRNA transcriptional rate was computed by dividing the bacterial relative abundance computed by Kraken by the abundances computed by MetaPhlAn2.
Analysis of bacterial species annotated to CRC disease was performed using the annotation from the Disbiome database (53).
Secondary structure prediction.
Prediction of the secondary structure of reads annotated to bsRNAs and the associated MFE level were computed using the RNAFold algorithm (67). The algorithm was applied to reads assigned to hsa-miRNAs, non-miRNA sRNA annotations, the human genome, or bsRNAs or not aligned. For each data set, the average MFE was computed. The statistical significance of MFE differences among read types was computed using the Wilcoxon rank sum test.
Analysis of hsa-miRNAs correlated with
Correlations between
The list of Gene Ontology Biological Processes enriched for the list of genes targeted by
Machine learning approach for sample classification.
A Random Forest classifier was applied to classify each class of subjects, as described previously (5). Specifically, four types of the obtained quantitative profiles, namely, bDNA, bsRNA, hsa-miRNA, and hsa-sncRNA, were considered, together with any combination of the four, for a total of 14 different data categories. For each type of data, three different classification comparisons were considered: CRC versus healthy, CRC versus adenoma, and adenoma versus healthy. The CRC class, when present, has been considered the positive class; the adenoma class has been considered the positive class in the lattermost combination. The total number of machine learning experiments was 42. Each experiment consisted of a 10-fold cross-validation iterated 20 times. Folds contained comparable numbers of positive-class and negative-class representatives. The AUC was computed as the average among 200 tests for each comparison. Feature rankings from the Random Forest shown in this work have been rigorously obtained by considering each time the sole training set to avoid overestimations due to overfitting issues.
Evaluation of the identified stool hsa-miRNA signature in CRC tissues from public datasets.
The expression of hsa-miRNAs belonging to the feature set providing the best classification of CRC and healthy subjects was evaluated in independent analyses of hsa-miRNA expression in primary CRC tissues and adjacent colonic mucosa. In detail, the expression of 19 hsa-miRNAs belonging to the Random Forest signature was compared with that seen with three data sets of differentially expressed hsa-miRNAs (“Mjelle et al., 2019” [69], “Neerincx et al., 2015” [70], and “Sun et al., 2016” [71] sets) from Table S2 in reference 15, the set of 76 differentially expressed hsa-miRNAs from Table S1 in reference 12, and the set of hsa-miRNAs differentially expressed between CRC and adjacent tissues as obtained from the reanalysis of three microarray experiments (GEO accession no. GSE108153, GSE115513, and GSE35834). Differential expression analysis of microarray data was performed using the GEO2R tool, and microarray probes were converted to miRBase v21 annotations using miRiadne (68). For each analysis, only significantly differentially expressed hsa-miRNAs were considered (adjusted P value < 0.05) and the overlap of the numbers of these hsa-miRNAs and those from the Random Forest signature was statistically evaluated using Fisher’s exact test.
Data availability.
Raw sRNA-Seq data are deposited in Gene Expression Omnibus (GEO) with the identifier GSE132236. Raw metagenomic data are deposited in the Sequence Read Archive (SRA) database with the identifier SRP136711.
b Department of Computer Science, University of Turin, Turin, Italy
c Department of Surgical and Medical Sciences, University of Catanzaro, Catanzaro, Italy
d Department of Colorectal Surgery, Clinica S. Rita, Vercelli, Italy
e Department CIBIO, University of Trento, Trento, Italy
f Imperial College, London, United Kingdom
g Department of Medical Sciences, University of Turin, Turin, Italy
h Department of Molecular Biology of Cancer, Institute of Experimental Medicine, Prague, Czech Republic
Luxembourg Centre for Systems Biomedicine
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2019 Tarallo et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
ABSTRACT
Dysbiotic configurations of the human gut microbiota have been linked to colorectal cancer (CRC). Human small noncoding RNAs are also implicated in CRC, and recent findings suggest that their release in the gut lumen contributes to shape the gut microbiota. Bacterial small RNAs (bsRNAs) may also play a role in carcinogenesis, but their role has been less extensively explored. Here, we performed small RNA and shotgun sequencing on 80 stool specimens from patients with CRC or with adenomas and from healthy subjects collected in a cross-sectional study to evaluate their combined use as a predictive tool for disease detection. We observed considerable overlap and a correlation between metagenomic and bsRNA quantitative taxonomic profiles obtained from the two approaches. We identified a combined predictive signature composed of 32 features from human and microbial small RNAs and DNA-based microbiome able to accurately classify CRC samples separately from healthy and adenoma samples (area under the curve [AUC] = 0.87). In the present study, we report evidence that host-microbiome dysbiosis in CRC can also be observed by examination of altered small RNA stool profiles. Integrated analyses of the microbiome and small RNAs in the human stool may provide insights for designing more-accurate tools for diagnostic purposes.
IMPORTANCE The characteristics of microbial small RNA transcription are largely unknown, while it is of primary importance for a better identification of molecules with functional activities in the gut niche under both healthy and disease conditions. By performing combined analyses of metagenomic and small RNA sequencing (sRNA-Seq) data, we characterized both the human and microbial small RNA contents of stool samples from healthy individuals and from patients with colorectal carcinoma or adenoma. With the integrative analyses of metagenomic and sRNA-Seq data, we identified a human and microbial small RNA signature which can be used to improve diagnosis of the disease. Our analysis of human and gut microbiome small RNA expression is relevant to generation of the first hypotheses about the potential molecular interactions occurring in the gut of CRC patients, and it can be the basis for further mechanistic studies and clinical tests.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer