INTRODUCTION
Microbes can form complex multispecies communities that perform critical functions in maintaining the integrity of their environment [,] or the well-being of their hosts []. For example, microbial communities play key roles in nutrient cycling in soils [] and crop growth []. In humans, the gut microbiota plays important roles in our nutrition [], immune system response [], pathogen resistance [], and even our central nervous system response []. Still, species invasions (e.g., pathogens) and extinctions (e.g., due to antibiotic administration) produce changes in the species assemblages that may shift these communities to undesired compositions []. For instance, antibiotic administrations can shift the human gut microbiota to compositions making the host more susceptible to recurrent infections by pathogens []. Similarly, intentional changes in the species assemblages, such as by using fecal microbiota transplantations, can shift back these communities to desired “healthier” compositions [,]. Therefore, improving our ability to rationally manage these microbial communities requires that we can predict changes in the community composition based on changes in species assemblages []. Building these predictions would also reduce managing costs, helping us to predict which changes in the species' assemblages are more likely to yield a desired community composition. Unfortunately, making such a prediction remains challenging because of our limited knowledge of the diverse physical [], biochemical [], and ecological [,] processes governing the microbial dynamics.
To overcome the above challenge, we present a deep learning framework that automatically learns the map between species assemblages and community compositions from training data only, without knowing the underlying microbial dynamics. We systematically validated our framework using synthetic data generated by classical ecological dynamics models, demonstrating its robustness to changes in the system dynamics and measurement errors. Then, we applied our framework to real data of both in vitro and in vivo communities, including ocean and soil microbial communities [,], Drosophila melanogaster gut microbiota [], and human gut [] and oral microbiota []. Across these diverse microbial communities, we find that our framework learns to predict accurate out-of-sample compositions given a few training samples. Our results show how deep learning can be an enabling ingredient for understanding and managing complex microbial communities.
PREDICTING MICROBIOME COMPOSITIONS FROM SPECIES ASSEMBLAGES
Consider the pool of all microbial species (or taxa) that can inhabit an ecological habitat of interest, such as the human gut. A microbiome sample obtained from this habitat can be considered as a local community assembled from with a particular species assemblage. The species assemblage of a sample is characterized by a binary vector , where its th entry satisfies (or ) if the th species is present (or absent) in this sample. Each sample is also associated with a composition vector , where its th entry is the relative abundance of the th species, and is the probability simplex. Therefore, our problem can be formalized as learning the map
Next, we show it is possible to predict the microbial composition from species assemblage without knowing the mechanistic details of all the above processes. Our approach consists in learning the map directly from a data set with microbiome samples. We arrange each of those samples as a pair satisfying the map of Equation (), see Figure . Note that microbiome samples are readily available using standard metagenomic sequencing techniques.
[IMAGE OMITTED. SEE PDF]
Conditions for predicting compositions from species assemblages
To ensure that the problem of learning from is mathematically well-posed, we make the following assumptions. First, we assume that the species pool in the habitat has universal dynamics [] (i.e., different local communities of this habitat can be described by the same population dynamics model with the same parameters). This assumption is necessary because, otherwise, the map does not exist, implying that predicting community compositions from species assemblages has to be done in a sample-specific manner, which is a daunting task. The universal dynamics assumption will be satisfied when samples in the data set were collected from similar environments. Indeed, in this case, the environmental factors can be treated as roughly fixed and hence need not be used for composition prediction. For in vitro communities, the universal dynamics assumption is satisfied if samples were collected from the same experiment or multiple experiments but with very similar environmental conditions. For in vivo communities, empirical evidence indicates that the human gut and oral microbiota of healthy adults, as well as certain environment microbiota, display strong universal dynamics [].
Second, we assume that the compositions of the collected samples represent steady states of the microbial communities. This assumption is natural because the map is not well defined for highly fluctuating microbial compositions. We note that observational studies of host-associated microbial communities such as the human gut microbiota indicate that they remain close to stable steady states in the absence of drastic dietary change or antibiotic administrations [,,].
Finally, we assume that for each species assemblage there is a unique steady-state composition . In particular, this assumption requires that true multistability does not exist for the species pool (or any subset of it) in this habitat. This assumption is required because, otherwise, the map is not injective, and the prediction of community compositions becomes mathematically ill-defined.
In practice, we expect that the above three assumptions cannot be strictly satisfied. Therefore, any algorithm that predicts microbial compositions from species assemblages needs to be systematically tested to ensure its robustness against errors due to the violation of such approximations. Note that we can a priori check if a microbiome data set satisfies the universal dynamics assumption using the Dissimilarity-Overlap analysis []. Yet, it is mathematically challenging to a priori check if the other two assumptions are satisfied for real data. Nevertheless, the ability to accurately predict microbiome compositions from species assemblage is a posteriori evidence of the validity of the above three assumptions.
Learning to predict species compositions
Consider building a map , parametrized by , giving the predicted composition associated with the species assemblage . Under the above assumptions, we can in principle learn the map of Equation () from the data set by training (i.e., adjusting its parameters to ensure that approximates ). Existing deep learning network architectures and training methods [,], such as ResNet [] trained with a gradient descent algorithm, are natural candidates to solve this problem (Methods Section). We found that it is possible to train a ResNet architecture for predicting microbiome compositions in simple cases like small in vitro communities (Supporting Information Note ). But for large in vivo communities like the human gut microbiota, ResNet does not perform very well (Figure ). The poor performance of ResNet is likely due to a vanishing gradient problem during training []. Namely, the ResNet architecture must satisfy two restrictions that are very particular to the map of Equation (). First, the predicted compositions must be compositional (i.e., ). Second, the predicted relative abundance of any absent species in the assemblage must be identically zero (i.e., should imply that ).
To overcome the limitations of traditional deep learning frameworks based on neural networks (such as ResNet) in predicting microbial compositions from species assemblages, we developed cNODE (compositional Neural Ordinary Differential Equation, see Methods Section and Figure ). We design the cNODE framework using the notion of Neural Ordinary Differential Equations, which can be interpreted as a continuous limit of ResNet architecture []. Crucially, the architecture and initialization of cNODE ensure that the above two restrictions are satisfied by construction. Furthermore, cNODE's architecture naturally circumvents the typical difficulties of handling zero values associated with compositional data analysis. Zero abundance values often occur in human microbiome datasets because of their highly personalized compositions across hosts (i.e., different individuals tend to have different species assemblages). To evaluate the prediction error of cNODE, one can choose any dissimilarity measure between the predicted and actual compositions related to a given species assemblage. Once this dissimilarity measure is selected, we train cNODE using a meta-learning algorithm for a given number of epochs to minimize the average prediction error in a training data set (Methods Section). Using this meta-learning algorithm improves the ability of cNODE for predicting the composition of never-seen-before species assemblages. Once trained, we evaluate the performance of cNODE by calculating its average prediction error in a test data set containing samples not used during the training.
Figure illustrates the application of cNODE in a small experimental community of bacterial species of Drosophila melanogaster microbiota studied by Gould et al. []. The data set obtained from this study has samples (Methods Section). To illustrate the potential of cNODE, we consider a training data set of 21 randomly chosen samples (Figure ). As explained before, we arrange each training sample as a pair of “species assemblage” (top) and “composition” (bottom). Once trained, the main use of cNODE is to predict the composition of “never-seen-before” species assemblages —namely, “test assemblages” that are not in the training data set. To evaluate the performance of cNODE for predicting such test assemblages, we use as test data set the remaining five experimental samples not included during training. Figure shows that the trained cNODE predicts accurate compositions for the test species assemblages. For example, cNODE predicts that in the assemblage of species 3 with species 4 (which was not used for training), species 3 will become nearly extinct. This prediction agrees well with the actual experimental result (sample 26 in Figure ).
RESULTS
In silico validation of cNODE with large species pools
We first evaluated cNODE's performance using in silico microbiome samples generated as steady-state compositions of pools with species and Generalized Lotka-Volterra (GLV) population dynamics (Methods Section). We characterize the population dynamics of a species pool using two parameters. First, the connectivity , characterizing how likely is that two species in the pool interact directly. Second, the typical interaction strength , characterizing the typical effect of one species over the per-capita growth rate of another species if they interact. Different habitats where the species pool is assembled are thus represented by different parameters . Note that, despite its simplicity, the GLV model successfully describes the population dynamics of microbial communities in diverse environments, from the soil [] and lakes [] to the human gut [,,].
Figure shows the performance of cNODE during training. The training and test datasets have samples for this panel. Note that the training prediction error decreases with the number of training epochs, especially for low values of . Interestingly, the test prediction error reaches a plateau after sufficient training epochs, regardless of the value of . This plateau implies that cNODE was adequately trained with low overfitting. Note that the plateau's value increases with (i.e., the test prediction error increases). This result remains valid for different training data set sizes and different values for the parameters . In all these cases, the test prediction error reaches a plateau whose value increases both by increasing (Figure ) or (Figure ). But, crucially, such an increase can be compensated by increasing the number of samples in the training data set. This result implies that, in general, cNODE requires a larger number of training samples in species pools with higher connectivity or higher typical interaction strength between species. Overall, these results suggest that using or more training samples is enough to adequately train cNODE, regardless of the habitat type. In this case, we also observe a high correlation between the true and predicted compositions in the test data set, as expected from a low test prediction error (Figure ).
[IMAGE OMITTED. SEE PDF]
To systematically evaluate the robustness of cNODE against violation of its three key assumptions, we performed three types of validations. In the first validation, we generated datasets that violate the assumption of universal dynamics (Methods Section). In this case, if two species interact, the effect of one species over the per-capita growth rate of the other species changes on average by among samples in the data set. Therefore, the value corresponds to universal dynamics, and larger values of correspond to more significant losses of universal dynamics. We find that cNODE is robust against universality loss as its asymptotic prediction error changes continuously and maintains a reasonably low test prediction error up to (Figure ). cNODE is also robust to losses of universal dynamics that occur when species interact with different species in a sample-specific manner (Figure ).
In the second validation, we evaluated the robustness of cNODE against measurement noises in the relative abundance of species (Methods Section). We characterize the noise intensity by a constant . The measurement noise may cause some absent species to be measured as present and vice-versa. We find that cNODE performs adequately up to (Figure ).
In the final validation, we generated datasets with true multistability by simulating a population dynamics model with nonlinear functional responses (Methods Section). For each species assemblage, these functional responses generate two interior equilibria in different “regimes”: one regime with low biomass and the other with high biomass. Therefore, each species assemblage can have two associated compositions. We built training datasets by choosing a fraction of samples from the first regime and the rest from the second regime. We find that cNODE is robust enough to provide reasonable predictions up to (Figure ).
cNODE predicts microbiome compositions in real microbial communities
We evaluated cNODE using six microbiome datasets of different habitats (Supporting Information Note ). The first data set consists of samples [] of the ocean microbiome at the phylum taxonomic level, resulting in different taxa. The second data set consists of in vivo samples of Drosophila melanogaster gut microbiota of species [], as described in Figure . The third data set has samples of in vitro communities of soil bacterial species []. The fourth data set contains samples of the Central Park soil microbiome [] at the phylum level ( phyla). The fifth data set contains samples of the human oral microbiome [] at the genus level ( genera). The final data set has samples of the human gut microbiome from the Human Microbiome Project [] at the genus level ( genera). Note that for each data set, to ensure cNODE has enough training samples, we chose to work at a specific taxonomic level so that the number of samples , where is the total number of taxa at the specific taxonomic level. Note that, based on the Dissimilarity-Overlap analysis, all the six microbiome datasets display the signature of universal microbial dynamics to some extent (Supporting Information Note and Figure ).
To evaluate cNODE, we performed the leave-one-out cross-validation on each data set (Methods Section). The median test prediction errors were 0.06, 0.066, 0.079, 0.107, 0.211, and 0.242 for the six datasets, respectively (Figure ). These errors are consistent with the strength of universality observed in each data set. To understand the meaning of these errors, for each data set we inspected five pairs () corresponding to the observed and out-of-sample predicted composition of five samples. We chose the five samples based on their test prediction error. Specifically, we selected those samples with the minimal error, close to the first quartile, closer to the median, closer to the third quartile, and with the maximal error (columns in Figure , from left to right). We found that samples with errors below the third quartile provide acceptable predictions (left three columns in Figure ), while samples with errors close to the third quartile or with the maximal error do demonstrate salient differences between the observed and predicted compositions (right two columns in Figure ). Note that in the sample with largest error of the human gut data set (Figure , rightmost column), the observed composition is dominated by Prevotella (pink) while the predicted sample is dominated by Bacteroides (blue). This drastic difference is likely due to different dietary patterns []. These results also confirm that or more training samples are enough to adequately train cNODE, regardless of the habitat type. Note that using other taxonomic levels in these experimental datasets may change the performance of cNODE because it will effectively change the sample size.
[IMAGE OMITTED. SEE PDF]
DISCUSSION
cNODE is a deep learning framework to predict microbial compositions from species assemblages only. We validated its performance using in silico, in vitro, and in vivo microbial communities, finding that cNODE learns to perform accurate out-of-sample predictions using a few training samples. Classic methods for predicting species abundances in microbial communities use inference based on population dynamics models [,,,]. However, these methods typically require high-quality time-series data of species absolute abundances, which can be difficult and expensive to obtain in vivo microbial communities. cNODE circumvents needing absolute abundances or time-series data. However, compared to the classic methods, the cost to pay is that cNODE cannot be mechanistically interpreted because of the lack of identifiability inherent to compositional data [,]. We also found that cNODE can outperform existing deep-learning architectures like ResNet, specially when predicting the composition of large in-vivo microbiomes. Recently, Maynard et al. [] proposed a statistical method to predict the steady-state abundance in ecological communities []. This method requires absolute abundance data of species, which are not available in most microbiome datasets. cNODE can outperform this statistical method despite using only relative abundances (Supporting Information Note ). See also Supporting Information Note and Figure for a discussion of how our framework compares to other related works.
Deep learning techniques are actively applied in microbiome research [], such as for classifying samples that shifted to a diseased state [], predicting infection complications in immunocompromised patients [], or predicting the temporal or spatial evolution of certain species collection [,]. However, to the best of our knowledge, the potential of deep learning for predicting the effect of changing species assemblages was not explored nor validated before. Our framework, based on the notion of neural ODE [], is a baseline that could be improved by incorporating additional information. For example, incorporating available environmental information such as pH, temperature, age, BMI, body-site, and host's diet could enhance the prediction accuracy. This additional information would help us predict the species present in different environments. Adding “hidden variables” such as the unmeasured total biomass or unmeasured resources to our ODE will enhance the expressivity of the cNODE [,], but this may result in more challenging training. Finally, if available, knowledge of the genetic similarity between species can be leveraged into the loss function by using the phylogenetic Wasserstein distance [] that provides a well-defined gradient [].
We anticipate that a useful application of our framework is to predict if by adding some species collection to a local community we can bring the abundance of target species below the practical extinction threshold. Thus, given a local community containing the target (and potentially pathogenic) species, we could use a greedy optimization algorithm to identify a minimal collection of species to add such that our architecture predicts that they will decolonize the target species.
Our framework has a few limitations. For example, cNODE cannot accurately predict the abundance of taxa that have never been observed in the training data set. An additional limitation of our current architecture is that it assumes that true multistability does not exist—namely, a community with a given species assemblage permits only one stable steady-state, where each species in the collection has a positive abundance. For complex microbial communities such as the human gut microbiota, the highly personalized species collections make it very difficult to decide if true multistability exists or not. We could extend our framework to handle multistability by predicting a probability density function for the abundance of each species. True multistability would correspond to predicting a multimodal density function in such a case. Datasets with insufficient sequencing depth or coverage can produce samples with “fake” multistability, leading to prediction errors that our framework cannot resolve. Indeed, the in-silico validation of cNODE in Figure indicates that measurement errors can significantly degrade the performance of cNODE.
In conclusion, the many species and the complex, uncertain dynamics that microbial communities exhibit, have been fundamental obstacles in our ability to learn how they respond to alterations, such as removing or adding species. Moving this field forward may require losing some ability to interpret the mechanism behind their response. In this sense, deep learning methods could enable us to rationally manage and predict complex microbial communities' dynamics.
METHODS
A ResNet architecture for predicting microbiome compositions from species assemblages
As a top-rated tool in image processing, ResNet is a cascade of hidden layers where the state of the th hidden layer satisfies , for some parametrized function with parameters . These hidden layers are plugged to the input and the output layers, where and are some functions. Crucially, for our problem, any architecture must satisfy two restrictions: (1) vector must be compositional (i.e., ); and (2) the predicted relative abundance of any absent species must be identically zero (i.e., should imply that ). Simultaneously satisfying both restrictions requires that the output layer is a normalization of the form , and that is a non-negative function (because is required to ensure the normalization is correct). This result is likely due to the normalization of the output layer, which challenges the training of neural networks because of vanishing gradients []. The vanishing gradient problem is often solved by using other normalization layers such as the softmax or sparsemax layers []. However, we cannot use these layers because they do not satisfy the second restriction. We also note that ResNet becomes a universal approximation only in the limit , which again complicates the training [].
The cNODE architecture
In cNODE, an input species assemblage is first transformed into the initial condition , where (left in Figure ). This initial condition is used to solve the set of nonlinear ODEs
Training cNODE
We train cNODE by adjusting the parameters to approximate with . To do this, we first choose a distance or dissimilarity measure to quantify how dissimilar are two compositions . We choose the Bray-Curtis [] dissimilarity to present our results, however, the performance of cNODE is quite robust to the specific distance or dissimilarity measure used (Figure ). Specifically, for a data set , we use as loss function the prediction error
Once trained, we calculate cNODE's test prediction error that quantifies cNODE's performance in predicting the compositions of never-seen-before species assemblages. Test prediction errors could be due to a poor adjustment of the parameters (i.e., inaccurate prediction of the training set), low ability to generalize (i.e., inaccurate predictions of the test data set), or violations of our three assumptions (universal dynamics, steady-state samples, no true multistability).
Generating in-silico data for validating cNODE
We generated in silico data for validating cNODE as steady-state compositions of pools with species and generalized Lotka-Volterra (GLV) population dynamics. The GLV model reads []:
To validate cNODE, we generated synthetic microbiome samples as steady-state compositions of GLV models with random parameters by choosing if , and , for different values of connectivity and characteristic inter-species interaction strength (Supporting Information Note ).
Generating in silico data to test the robustness of cNODE
For this, given a “base” GLV model with parameters , we consider two forms of universality loss (Supporting Information Note ). First, samples are generated using a GLV with the same ecological network but with those non-zero interaction strengths replaced by , where characterizes the changes in the typical interaction strength. Second, samples are generated using a GLV with slightly different ecological networks obtained by randomly rewiring a proportion of their edges.
In the second validation, we evaluated the robustness of cNODE against measurement noises in the relative abundance of species. For this, for each sample , we first change the relative abundance of the th species from to , where characterizes the measurement noise intensity. Then, we normalize the vector to ensure it is still compositional, that is, . Due to the measurement noise, some species that were absent may be measured as present and vice-versa.
In the third validation, we generated datasets with true multistability by simulating a population dynamics model with nonlinear functional responses (Supporting Information Note ). For each species collection, these functional responses generate two interior equilibria in different “regimes”: one regime with low biomass, and the other regime with high biomass. We then train cNODE with datasets obtained by choosing a fraction of samples from the first regime, and the rest from the second regime.
Validating cNODE using real microbiome data sets
To validate cNODE, we performed a leave-one-out cross-validation over real microbiome data sets (see descriptions on Supporting Information Note ). For each data set, we measured the prediction error of cNODE using each sample as a test set and the rest of the samples as a training set. We repeated this procedure for different learning rates and mini-batch sizes and selected the hyperparameters that minimized the average prediction error over the samples (see Table ).
ACKNOWLEDGMENTS
Marco Tulio Angulo gratefully acknowledges the financial support from CONACyT A1-S-13909 and PAPIIT 104915, México. Yang-Yu Liu acknowledges the funding support from the National Institutes of Health (R01AI141529, R01HD093761, RF1AG067744, UH3OD023268, U19AI095219, and U01HL089856).
CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.
AUTHOR CONTRIBUTIONS
Marco Tulio Angulo and Yang-Yu Liu conceived and designed the project. Sebastian Michel-Mata did the numerical analysis. Sebastian Michel-Mata and Xu-Wen Wang performed the real data analysis. All authors analyzed the results. Marco T. Angulo and Yang-Yu Liu wrote the manuscript. Sebastian Michel-Mata and Xu-Wen Wang edited the manuscript.
DATA AVAILABILITY STATEMENT
The data and code used in this study are available at .
East, Roger. 2013. “Soil Science Comes to Life.” Nature 501: S18. [DOI: https://dx.doi.org/10.1038/501S18a]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Microbes can form complex communities that perform critical functions in maintaining the integrity of their environment or their hosts' wellbeing. Rationally managing these microbial communities requires improving our ability to predict how different species assemblages affect the final species composition of the community. However, making such a prediction remains challenging because of our limited knowledge of the diverse physical, biochemical, and ecological processes governing microbial dynamics. To overcome this challenge, we present a deep learning framework that automatically learns the map between species assemblages and community compositions from training data only, without knowing any of the above processes. First, we systematically validate our framework using synthetic data generated by classical population dynamics models. Then, we apply our framework to data from in vitro and in vivo microbial communities, including ocean and soil microbiota, Drosophila melanogaster gut microbiota, and human gut and oral microbiota. We find that our framework learns to perform accurate out‐of‐sample predictions of complex community compositions from a small number of training samples. Our results demonstrate how deep learning can enable us to understand better and potentially manage complex microbial communities.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details



1 Department of Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey, USA
2 Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA
3 CONACyT—Institute of Mathematics, Universidad Nacional Autónoma de México, Juriquilla, Mexico