Natural fungal communities are characterized by a high diversity of unique biochemical pathways, including the biosynthesis of antimicrobial substances, organic acids, and even toxins (Branco, 2019). At the same time, individual fungi from one community might significantly differ from each other in terms of their functional capabilities (Wisecaver et al., 2014). This is due to differences in gene content.
Functional capabilities are of great interest to modern industrial biotechnology (Chergui et al., 2021; Habibi et al., 2021; Li et al., 2020). Also, information about gene content can be useful to solve ecological problems such as the bioconversion of solid waste (Chilakamarry et al., 2022) or caffeine utilization (Zhou et al., 2018). To get a more complete picture of the functional capabilities of the fungi under study, functional annotation of whole genome assemblies is used.
Unfortunately, it is not always possible to conduct whole genome sequencing (WGS) of explored fungi. This is partly due to the high cost of WGS and some of the technical hurdles that might appear while working with eukaryotic cells. At the same time, the majority of studies of fungi are accompanied by sequencing to determine fungal taxonomy. One of the most used taxonomy markers for fungi is the ITS region (Blaalid et al., 2013; Mbareche et al., 2020; Schoch et al., 2012) The region consists of two variable subregions (ITS1 and ITS2) separated by a conserved 5.8S gene. Although the ITS sequence can provide fungal taxonomy, it does not allow for functional annotation and evaluation of gene content. In this case, a tool to predict the gene content profile for a given ITS sequence would be useful.
There are a number of tools designed to predict the biochemical features of bacterial communities using 16S sequencing data (Douglas et al., 2020; Sun et al., 2020). At the same time, functional annotation of the fungal part of the community is usually performed using WGS, since the number of precisely annotated fungi is not enough to build complex predictive models such as the algorithm implemented in PICRUST2 (Douglas et al., 2020). For this reason, the fungal part of microbial communities often remains to be neglected.
Based on a similar idea of PICRUST2, we decided to develop an algorithm that is designed to solve a task for functional annotation of fungal ITS sequences datasets. The tool is based on the assumption that the functional capabilities of fungi are in strong correlation with their genetic-related organisms (Ke & Tsai, 2022; McLaughlin et al., 2015; Wisecaver et al., 2014). We implemented a modified K-nearest neighbors (KNN) model and tested it using shuffle-split cross-validation on the ITS subset. The model has been implemented as a command-line tool called FunFun.
METHODS Data collectionFirst of all, we collected 9271 whole fungal genomes of different assembly levels from the GeneBank database. Next, we extracted ITS1-5.8S-ITS2 sequence fragments from the genomes via in silico PCR method with ITS1F: CTTGGTCATTTAGAGGAAGTAA and ITS4: TCCTCCGCTTATTGATATGC primers using the ipcress tool from the exonerate software package (Slater & Birney, 2005). After this stage, 6132 fungal genomes were left, which might be the result of the unsuccessful annealing of primers. Generally, in silico PCR does not guarantee the absolute correctness of extracted sequences. To ensure that the extracted fragments truly were ITS, we verified them using ITSx software, which specializes in extracting the ITS region (Bengtsson-Palme et al., 2013). Based on the ITSx annotations, we built databases of individual ITS1 and ITS2 sequences, and full ITS1-5.8S-ITS2 fragments (they will be called Concatenates in the further text) belonging to 5882 fungal genomes. The collected database included representatives of 254 different families (Data S3).
Data processingThe genome assemblies were annotated using the Augustus tool (Keller et al., 2011). Predicted protein sequences were analyzed by the KofamKOALA tool (Aramaki et al., 2020) to obtain functional annotations. According to the results, we built gene content vectors constructed using the third level of the KoFAM hierarchy (430 vector components, presented in Data S2). Each component in this vector represents the relative abundance of the corresponding KEGG orthology group. To avoid the appearance of non-fungal biochemical pathways, the KEGG orthology data were manually filtered by removing the 09160 Human Disease branch.
However, at this stage, we did not have absolute confidence in the correctness of the functional profiling of the protein sequences obtained using Augustus. Therefore, we decided to apply a similar approach to protein sequences based on verified data. For this, a dataset was formed from 75 fungal assemblies from the RefSeq database, for which protein sequences were available. Here, to ensure the reliability of data, we considered only the protein sequence data obtained using the Eukaryotic Annotation Propagation Pipeline (“The NCBI Eukaryotic Genome Annotation Pipeline,”, n.d.), which uses transcriptomic data to perform the functional annotation. Thereafter, for each assembly, we constructed paired functional gene content profiles, where the first profile was based on curated protein sequences data from RefSeq and the second profile was built using Augustus gene annotations. Convergence of the results was estimated using Pearson's correlation coefficient for each assembly pair independently.
To preliminary validate our method of functional content profiling, we used the t-SNE decomposition for gene content vectors. We expected that observed clusters would be associated with fungal taxonomy, which is according to literature data (Ke & Tsai, 2022; McLaughlin et al., 2015; Wisecaver et al., 2014). Here, HDBSCAN was used for cluster identification. To estimate the convergence of observed clusters and fungal taxonomy, the Rand Index (RI) was used.
To demonstrate the correlation between fungal gene content and their lifestyle, we used a database compiled by the authors of the FunGUILD tool (Nguyen et al., 2016). To evaluate the convergence of the generated gene content clusters with lifestyle, we also applied HDBSCAN to isolate clusters, after which the clusters and lifestyles labels were compared using the Rand index.
Algorithm explanationFor each ITS sequence in the collected database, a relative abundance of each nucleotide k-mer is calculated. k was chosen to be 5 because the k-mers length equals 5 provides a satisfactory resolution of the fungal taxonomy, which can be seen on the dendrogram obtained by the method of hierarchical agglomerative clustering (Figure S4 in Data S1). For each sequence, we generate a k-mer relative abundances vector with a length of 45. In the further text, we will name such k-mer abundances vectors as k-mers vectors. Thereafter, we evaluate the cosine distance between the k-mers vector representing the target sequence and the k-mers vectors for each fungal reference amplicon variant (RAV) from the database. Thus, we construct a list of distances, which describe the similarity between the target sequence and each RAV from the database.
To predict the fungal gene content, the algorithm searches for KNN in the ε neighborhood using the generated distances list. Here, the ε neighborhood is the area, which limits the space where the neighbors searching will be performed. The number of neighbors not exceeding K and located in the area, limited by ε neighborhood, we call Kchosen. Such an approach allows ignoring too far sequences while gene content prediction. For each i-th fungi in our database we have a precalculated gene content profile Fi = (f1i, f2i, f3i, …, fLi), Fi is a vector with length L (L = 430). In Fi, each fji value is a fraction of the individual j-th function while the sum of these values is equal to 1. To calculate the functional gene content profile for the target sample we average the value of each individual function among the Kchosen neighbors.[Image Omitted. See PDF]where
Ftarget—predicted gene content profile of target fungi, Kchosen—number of neighbors not exceeding K and limited by ε neighborhood, fji—fraction of j-th function in the genome of i-th neighbored fungi from Kchosen.
In cases, when the target sequence has neighbors with cosine distance to the target of 0 (that usually means the identity of ITS sequences), only these neighbors are selected for the prediction. As a result, five different cases are possible, which are visualized in Figure 1.
FIGURE 1. Possible cases of neighbor selection (in this example K = 8). (a) When the number of reference amplicon variants (RAVs) included in the ε neighborhood is less than the specified number K; (b) when the ε neighborhood includes exactly K neighbors; (c) when the number of points included in the ε neighborhood is greater than K; (d) when the target sequence has at least one RAV with identical k-mers vector; and (e) when there are no neighbors in the ε neighborhood.
To test our algorithm, we split our database on test/train datasets and tested using shuffle-split cross-validation. The train/test ratio was chosen to be 80/20, with 10 shuffling iterations. So, for each shuffling epoch 1176 testing samples were used. For each epoch, the quality of prediction was estimated using a median R2 metric between real gene content vectors, which were built using whole genome annotations, and predicted vectors which were obtained with FunFun. Validation was performed for each amplicon type (ITS1, ITS2, and Concatenate) independently.
RESULTS Software descriptionWe have developed an algorithm aimed at predicting fungal gene content using ITS sequencing data. As an input, the algorithm can use ITS1, ITS2, or full-size ITS clusters (concatenates) in fasta format. A fasta file can contain either one sequence or several. As an output, the algorithm returns the table, where the first column is the KEGG orthology group (which we call function), and the subsequent columns correspond to predicted gene content vector components for each sample. The algorithm has been realized as a command-line tool which we named FunFun (Fungal Functional). Originally, FunFun was developed to be capable of predicting gene content for individual fungi but this algorithm can be useful in the analysis of metagenomic analysis.
The tool has two hyperparameters K—the maximum number of nearest neighbors selected and ε that defines the area where the K neighbors search is performed. In the case when the tool does not return a prediction for the analyzed sequence, the user can increase the ε value. However, it should be noted here that the greater the ε value, the lower the expected prediction quality (Figure 3a). On the other hand, as we observed the number of neighbors K does not drastically influence the prediction quality in general but in some cases might make the prediction more robust, especially while the target sequence has a lot of close neighbors. By default, K is set to be 10 and ε is set to be 0.5.
Validation resultsThe algorithm developed is based on the assumption that fungal metabolism correlates with their taxonomy, so we expected that gene content profiles, which we calculated with our approach, also should correlate with fungal taxonomy. To verify this assumption, t-SNE decomposition of calculated gene content vectors and clustering with the HDBSCAN algorithm (Campello et al., 2013, 2015) were performed (Figure 2). Figure 2 shows a two-dimensional mapping of the entire space of individual functions of the gene composition; each color in this picture corresponds to the individual fungal family. Predicted clusters and NCBI taxonomy labels (family level) were compared (RI = 0.929). The largest clusters were additionally labeled in Figure 2. We have also performed a t-SNE decomposition of 5-mers vectors of full-size ITS region (Figure S7 in Data S1). Additionally, we checked the relationship between clear clusters obtained on gene content vectors and corresponding full-size ITSs (regardless of GenBank taxonomy; Figure S2 in Data S1). This test resulted in a Rand index value of 0.963 which indicates a strong correlation.
FIGURE 2. t-SNE decomposition of fungal gene content profiles, where individual dots correspond to one sample. Each color corresponds to an individual fungal taxon (family level).
Fungal taxonomy is known to be not unambiguous (Hawksworth, 2011) which might lead to visible clustering imperfection and the presence of formal outliers for family clusters on the t-SNE plot that can be seen in Figure 2. In addition, the t-SNE algorithm itself tends to generate artifact clusters of dots which actually consist of different taxa (Figure S3 in Data S1). Moreover, our database is not balanced by families and some organisms may be the only members of their family (Massarinaceae, Aulographaceae, etc.), which can also affect the quality of clustering.
We also checked the validity of Augustus annotations by comparing functional profiling of Augustus annotations with corresponding RefSeq protein data. The analysis showed a good value for the Pearson correlation score (r = .62; Figure S1 in Data S1). Thus, the results of functional profiling obtained with Augustus are quite close to the results confirmed by experimental transcriptome data, so we decided that Augustus is suitable as a genome annotator in our method.
Using the number of neighbors of K = 10, we tested our algorithm with a range of different ε values (Figure 3). At the validation stage, we calculated the R2 value between the gene content vector, which we built on full genome data, and the gene content vector predicted on ITS sequence using FunFun. Also, we estimated the percentage of predicted samples from the test subset (gene content is predicted only if the target ITS has at least one neighbor in the ε neighborhood).
FIGURE 3. Validation results. (a) Dependence of gene content prediction quality on the ε value. (b) Dependence of the percentage of predictions on ε value (in some cases the ε neighborhood can include no one neighbors which does not allow to build predicted profiles with sufficient quality. Thus, the lower the ε value, the less the number of predicted profiles).
We found that as input data, it is more efficient to pass to the algorithm the full-size ITS sequences (concatenates). On average, the values of median R2 were at least 0.95 regardless of the ε value chosen. So, we set ε = 0.5 and K = 10 as default values. If necessary, the user is able to set custom values for them when running FunFun. It should be noted that different families are presented in different ratios which definitely affects both the quality of clustering and the quality of prediction. To detect the “problem fungal families” we carried out the full FunFun pipeline on all ITS sequences in the base, and divided them by family to compare the prediction accuracy (Figure S5 in Data S1).
Quite high R2 values might be partially explained by conservative gene functions, which are common features of all fungi. Examples of these functions are glycolysis and the citric cycle. So, we decided to additionally check how the method estimates relative abundances of the functions most variable among different fungal genomes. We chose such functions using the coefficient of variation, which is defined as the standard deviation normalized to the mean. These functions included glucosamine degradation, biosynthesis of ansamycine, toluene degradation, indole alkaloid biosynthesis, caffeine biosynthesis, and so on. We considered a function as highly variable if its coefficient of variation was greater than 10. As in the previous validation stage, we calculated the median R2 value between real and predicted gene content vectors. In the results, we obtained median R2 = 0.99 and the percentage of predicted equals 98.6%.
Some biochemical features are associated with fungal guilds (Liu et al., 2023; Nguyen et al., 2016; Su et al., 2023). We decided to check the convergence of guilds and gene content profiles. Guild labels were downloaded from the database used in the FUNGuild tool (Nguyen et al., 2016). Visible correlation between the guild and inferred gene content has been observed (RI = 0.84; Figure S6 in Data S1). Thus, our algorithm allows us to catch functions, characterized by fungal lifestyles.
Metagenome case modelingIn our view, the metagenomic analysis should be carried out after obtaining consensus assemblies of ITS sequences, after which the resulting multi-fasta can be processed by FunFun. To simulate this case, we took random sequences from our database for eight different divisions and predicted gene content profiles for them using our approach and compared the convergence of the predicted and real profiles using R2 (Figure 4). For prediction, we removed the predicted sequence from our database. Note that our algorithm does not use relative abundances of the fungi in the community, but returns just profiles for individual ITS consensus sequences since it is rather difficult to precisely estimate fungal composition in the sample due to the lack of reliable and reproducible methods to analyze mycobiome data (Martinsen et al., 2021; Weaver et al., 2019).
FIGURE 4. Comparison of gene content profiles based on whole genome sequence data and predicted with FunFun for each ITS from the simulated metagenome. For each pair, the lower bar represents the actual fractions of corresponding functions in the genome, while the upper bar shows function fractions predicted with FunFun.
Fungi being an essential part of microbial communities often remain neglected during metagenomic studies. In this work, we tried to develop an algorithm that would be capable of predicting fungal gene content based not on whole genome sequencing but using amplicon sequencing data, which is significantly less expensive. At the same time, there are no such tools for predicting the functional capability by the marker sequence for fungi.
Thus, we have developed FunFun, a novel tool designed for the estimation of fungal functional capability based on ITS amplicon sequencing data. It can be useful for the estimation of fungal gene content in individual fungi as well as in mycobiome. We should note that our approach has certain limitations while analyzing fungal ITS sequences which do not have close RAVs presented in the collected database. However, the authors hope that, in the future, the gradual addition of new genomes to the FunFun database will partially solve this issue. In this initial release, the database was collected from GenBank only and did not include the manually curated genomes from Mycocosm (Grigoriev et al., 2014) which we are planning to reannotate and add to FunFun in further releases.
We have shown that our approach allows us to predict not only the common functions such as a citric cycle or glycolysis but also the most varied. It might be helpful for researchers, who want to estimate the gene content of the investigated fungi.
AUTHOR CONTRIBUTIONSDanil V. Krivonos: Conceptualization (equal); data curation (equal); formal analysis (equal); software (lead); visualization (lead); writing – original draft (equal). Dmitry N. Konanov: Conceptualization (equal); formal analysis (equal); methodology (equal); writing – original draft (equal). Elena N. Ilina: Writing – review and editing (equal).
ACKNOWLEDGMENTSThis research was supported by the Russian Science Foundation grant 21-15-00431.
CONFLICT OF INTEREST STATEMENTThe authors declare no conflict of interest.
DATA AVAILABILITY STATEMENTFunFun can be installed via pip and can be downloaded from
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The study of individual fungi and their communities is of great interest to modern biology because they might be both producers of useful compounds, such as antibiotics and organic acids, and pathogens of various diseases. And certain features associated with the functional capabilities of fungi are determined by differences in gene content. Information about gene content is most often taken from the results of functional annotation of the whole genome. However, in practice, whole genome sequencing of fungi is rarely performed. At the same time, usually sequence amplicons of the ITS region to identify fungal taxonomy. But in the case of amplicon sequencing there is no way to perform a functional annotation. Here, we present FunFun, the instrument that allows to evaluate the gene content of an individual fungus or mycobiome from ITS sequencing data. FunFun algorithm based on a modified
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer