1. Introduction
Metabolomics, the comprehensive analysis of metabolites in a biological organism or system, provides a functional readout of its physiological state. Metabolites cover literally thousands of compound classes; their structural diversity is much larger than for biopolymers, such as proteins. Many metabolite structures remain to be discovered, in particular for non-model organisms and microbial communities; but even the human metabolome still holds surprises, such as the 2018 discovery of an antiviral metabolite [1]. Non-targeted metabolomics experiments usually rely on mass spectrometry (MS), where tandem MS (MS/MS) is used for structural annotations. This experimental setup can detect thousands of molecules in a biological sample.
Spectral libraries store MS/MS spectra of “known compounds”, meaning that the identity of the compound behind each spectrum is known. Large and comprehensive libraries are needed for annotating MS/MS spectra via library search. Importantly, not the number of spectra drives the amount of information present in a spectral library, but rather the number of compounds. Unfortunately, spectral libraries are not only substantially smaller than molecular structure databases, but they also grow at a much slower pace. The vast majority of MS/MS data in a spectral library are reference measurements from synthetic standards. Again, in the vast majority of cases, those standards are commercial compounds ordered from a vendor. When a “new spectral library” is measured in a lab, it usually contains almost exclusively compounds that are already available in other libraries. Hence, the number of MS/MS spectra is growing, but not the number of compounds; and as noted above, more spectra add little information. This is because standards become more and more expensive, the further we “leave the beaten path” of synthetic chemistry.
In the last decade, the development of computational methods for analyzing high resolution metabolite MS and MS/MS data has gained momentum [2,3,4,5,6,7,8]. Six CASMI (Critical Assessment of Small Molecule Identification) contests were conducted from 2012 to 2022; these are blind competitions evaluating computational methods, in particular for searching in a molecular structure database using a query MS/MS spectrum [9,10,11]. This particular task is a striking example of the differences between metabolomics and shotgun proteomics. Even the simplest approaches for searching a peptide MS/MS spectrum in a structure (that is, peptide sequence) database result in good annotation rates; in contrast, despite more than a decade of research, development of so-called “in silico methods” for searching in molecular structure databases remains a challenging problem in metabolomics.
In the following, we concentrate on metabolomics and the task of metabolite annotation from high-resolution, high mass accuracy MS/MS data. Clearly, our method can also be used in other areas such as environmental research, where compound annotations from MS data are sought.
Here, we introduce
2. Materials and Methods
Establishing the stereochemistry from fragmentation spectra is highly challenging and beyond the power of in silico methods. Hence, only the two-dimensional structure is considered when evaluating its correctness. We consider the identity and connectivity (with bond types) of the atoms but ignore the stereo-configuration for asymmetric centers and double bonds.
Datasets and molecular structure database. Training and test datasets were downloaded from the CASMI 2016 web page at
Molecular structures from the PubChem structure database were downloaded on 16 January 2019 and contain 97,168,905 compounds and 77,153,182 unique covalently bonded structures with mass up to 2000 Da. JSON files containing information on those PubChem records were downloaded on 4 July 2022.
Features and samples. Features were read from the downloaded JSON files. In general, features are extracted using the field names given in Table 1. The number of information sources is the amount of key-value pairs in the field
Each sample consists of a CASMI instance (query MS/MS) and a molecular structure candidate. As detailed below, we retrieve candidates from PubChem using the molecular formula of the query compound. Features were determined using the candidate structure; only the CSI:FingerID score feature considers both the candidate structure and the query MS/MS. Samples were labeled “correct” if a candidate’s structure and the correct query structure are identical, and “incorrect” otherwise. The training dataset consists of 250 samples labeled “correct” and 645,880 samples labeled “incorrect”. The test dataset consists of 127 samples labeled “correct” and 244,356 samples labeled “incorrect”.
Neural network architecture and training. For the large model, we used a network architecture consisting of two fully connected hidden layers with 256 neurons each. Both hidden layers use the rectified linear activation function (ReLU). The output layer uses a sigmoid activation function and as such, returns the estimated probability of the input belonging to the positive class. For the small model, we used a network architecture consisting of one fully connected hidden layer, with 12 neurons. The hidden layer again uses the ReLU activation function. We used the same output layer as in the large architecture, but add a 30% dropout [13] between the hidden layer and the output layer. Both architectures were trained with a batch size of 32,000 and for 5000 epochs. We used the binary cross-entropy as the loss function and Adam as the optimizer [14]. Features were standardized by setting their mean to zero, and standard deviation to one. Training and standardization were carried out using TensorFlow [15].
Evaluation. We trained and evaluated both models on the test dataset. For each of the 127 instances (queries), we assume that the correct molecular formula of the compound is known, as explained in [16,17]. Therefore, for each instance, the pool of candidates consists of those PubChem structures with the correct molecular formula. For every candidate compound, we calculated features as described earlier and predicted the probability of that compound being labeled “correct” using our trained model. The candidate compound with the highest probability is then returned as the annotation for that instance.
As noted above, we employ CSI:FingerID ensuring structure disjoint cross-validation. For each query compound, we choose the cross-validation CSI:FingerID model so that the compound structure is not in the training data. See [16,17] for details.
Random mass spectra and empty mass spectra. To generate an empty MS/MS spectrum, we inserted a precursor peak with an intensity of 100% into the otherwise empty MS/MS spectrum. For that, we used the m/z of the MS1 peak with the highest intensity in a 10 ppm window around the compound m/z. To generate a random MS/MS spectrum, we first inserted the precursor peak with m/z of M as described above. Then, we added random peak m/z uniformly distributed in . Intensities were drawn uniformly from . Evaluations were performed using the
3. Results
We evaluated
Measuring MS/MS spectra requires a lot of time and effort. We found that we can reach 72.4% correct annotations without MS/MS data, just using the precursor m/z and the metascore (small neural network). This is also possible if we replace our MS/MS spectra with random spectra (71.7% correct annotations) or empty spectra (71.0% correct annotations). These numbers are on par with those of any in silico method that participated in the original CASMI 2016 contest [9]. Hence, we can basically replace all parts of the mass spectrometry instruments responsible for measuring MS/MS data with a simple random number generator, or by a device that generates zeroes. We conjecture that this simple trick may save us a considerable amount of money and time in the future.
4. Remorse
It might not have escaped the reader’s attention that we do not seriously propose using
Early publications [16,18,19] went through a great deal of effort to demonstrate and evaluate not only the power but also the limitations of the presented methods. In contrast, it lately became a habit to demonstrate that a new method strongly out-competes all others in some evaluations. It is sad to say that those claims are often pulverized later in blind CASMI challenges. One way to achieve the desired outcome is to carefully select our competitors; leave out those that are hard to beat. Another possibility is to evaluate other methods ourselves but not invest too much time into this part of the evaluation; nobody will mind the inexplicable performance drop of the other method unless we explicitly mention it. A third possibility is to restrict reported results to one evaluation dataset where the desired outcome is reached. Finally, we may simply present results as if our method was the first to ever deal with the presented problem, and simply ignore the presence of other methods; this is the approach we chose for
Do not evaluate using your training data. For the
In fact, we do not even have to know those numbers from evaluation, to suspect that something fishy is going on. We have trained a neural network with 68.3 k parameters on a training dataset containing 250 queries, corresponding to 250 positive samples (see Methods). Even tricks such as regularization, dropout [13], or batch normalization [20]—which we deliberately did not use—could not prevent the resulting model from overfitting to the training data. If the number of training samples is small, then the use of a large “vanilla” deep neural network (also known as multilayer perceptron) should raise some doubt. During the last years, numerous network architectures were developed that substantially reduced the number of parameters in the model, in order to improve its generalization capabilities: Convolutional neural networks for image recognition are a prominent example. Yet, unless there is a convincing reason why such network architecture should be applied to the problem at hand, we cannot simply grab a random architecture and hope that it will improve our results! Stating that a particular model architecture is “state-of-the-art” in machine learning, does not mean it is reasonable to be applied. Machine learning cannot do magic.
For generalizability, less is often more. As reported above, we also trained a sparser geometry of the neural network with only one hidden layer of 12 neurons. This implies that we have substantially fewer parameters in our model. In contrast to our claims in the results section, this does not mean that the model is less powerful. Rather, there is a good chance that the smaller model will perform better on any independent data. We deliberately made it even harder for the small model to “learn”, using 30% dropout when training the network [13]. This means that 30% of the neurons in the hidden layer are randomly removed before forward and backpropagation is used to update edge weights. Doing so considerably worsened the performance of the model when trained on the evaluation data (memorizing) but simultaneously, improved the performance when trained on the training data and evaluated on the evaluation data (learning). For this more sensible machine learning model, we reach 79.5% correct annotations when training and evaluating on the test dataset, and 76.4% when training on the training data and evaluating on the evaluation data.
Do not evaluate using your training data, reloaded. Unfortunately, the separation of the training and test sets is not as easy as one may think. For an in silico method, saying “we used dataset A for training and dataset B for evaluation” is insufficient. The first thing to notice is that the same spectra are often found in more than one spectral library, and we have to ensure that we do not train on the same data we evaluate on. A prominent example involves the data from the CASMI 2016 contest, which have been repeatedly used for method evaluations. These data were later included in spectral libraries such as GNPS (Global Natural Products Social) and MoNA (MassBank of North America). Hence, a method trained on GNPS and evaluated on CASMI 2016 has seen all the data, and the reported numbers are meaningless for an in silico method.
Yet, even a spectrum-disjoint evaluation is insufficient. If the same compound is measured twice for two libraries, then the resulting spectra will be extremely similar. In fact, library search in public and commercial spectral libraries is based on this observation. Hence, we must ensure that the same compound is never found in training and test data, to avoid that we are basically evaluating spectral library search, which we already know to work rather well. Unfortunately, the rabbit hole goes deeper. This compound-disjoint evaluation is still insufficient. Some compounds (say, L- and D-threose, threose and erythrose) may just differ in their stereochemistry. Yet, stereochemistry has only a minor impact on the fragmentation of a compound. Hence, evaluations of in silico methods ignore stereochemistry and evaluate search results at the level of molecular structures. (At present, an in silico method evaluation claiming to differentiate L/D-threose and L/D-erythrose based on MS/MS data, must not be taken seriously.) To this end, we have to treat all of those compounds as the same structure and ensure that our evaluation is structure-disjoint. These are the minimum requirements for an in silico evaluation to be reasonable.
There are situations that require an even stricter separation of training and test data. For example, the second evaluation of
Some methods employ machine learning without clearly stating so. Assume that a method adds the scores of two in silico tools. Clearly, we require a multiplier for one of the scores, to make them comparable. Yet, how do we find this multiplier? In fact, it is not too complicated to come up with a set of feature multipliers for
5. Metascores
Unlike many publications from the last years, the resulting procedures for training and evaluation of
What is a metascore. It is important to understand that in general, “metascores” have nothing to do with “metadata”, except for the prefix. This appears to be a common misconception. Metadata is data about your experiment. It may be the expected mass accuracy of the measurements, the column type used for chromatography, or the type of sample one is looking at. Such information is already used, to a certain extent, by all existing in silico methods. For example, we always have to tell these methods about the expected mass accuracy of our measurements.
In contrast, a “metascore” is simply integrating two or more different scores into one. For example, we may combine the scores of two or more in silico methods, such as MetFrag and CFM-ID in [21]. Second, we may combine in silico methods and retention time prediction [21,22,23,24]. Third, we may combine an in silico method with taxonomic information about the analyzed sample [25]. Numerous other types of metascores can be imagined; all of them have in common that they rely on data or metadata of the sample we are analyzing. Such metascores are not what we will talk about in the rest of this paper as they do not have the intrinsic issues discussed in the following.
Yet, in the field of in silico methods, the term “metascore” is often used with a different type of method in mind. In many cases, this term describes a method that combines an in silico method and “side information” from the structure databases.
In the remainder of this manuscript, we will speak about metascores when in fact, a more precise denomination would be “metascore of the
Blockbuster metabolites. Consider a movie recommendation system that, based on your preferences, suggests new movies to watch. Such recommendation systems are widely-known these days, in particular from streaming services such as Netflix, Disney+, or Amazon Prime. Yet, a recommendation system may completely ignore the user’s preferences and suggest blockbuster movies all the time; it may choose films according to their financial success. Recommendations will sometimes appear very accurate; after all, we all love blockbuster movies, at least statistically speaking. In evaluations, we will not notice that the recommendation system is basically useless. In practice, we will.
Similar to blockbuster movies, we introduce the term “blockbuster metabolite” for a small molecule that is “better” than others. We stress that this term is meant to include compounds of biological interest, including drugs and toxic compounds. How can a compound be “better” than others? Different features have been suggested to differentiate “interesting” from “uninteresting” compounds. This includes citation counts (the number of papers, patents, etc., that contain the compound), production volume, or the number of molecular structure databases a compound is contained in. A metascore will combine the original score that considers the MS/MS data, via some function that also considers, say, the citation count.
A blockbuster metabolite is a compound in the structure database we search in that has a much larger value of the considered feature than all other compounds in the structure database. For example, the compound may have ten thousand citations, whereas most compounds in PubChem have none. In more detail, the compound only needs to have more citations than all compounds it is competing with; in silico methods use the precursor m/z to filter the database for valid candidates, so our compound needs to have a higher citation count than all compounds with a similar mass. The term “blockbuster metabolite” indicates that our metascore may prefer those compounds over all other candidates, independent of the measured MS/MS data.
Reporting metascore annotation rates is futile. Evaluating a metascore method using a spectral library is meaningless and even misleading. In any such evaluation, we report annotation rates on reference data, with the implicit subtext that somewhat similar annotation rates are reached on biological data. Yet, when evaluating a method that is using a metascore, reported annotation rates have no connection whatsoever with rates that you can reach for biological data. This is a general critique of any such evaluation and does not depend on the utilized metascore.
To understand this phenomenon, consider again our
Now, why is it not a good idea to evaluate a metascore on a spectral library, and what can we actually learn from such an evaluation? We can learn that the distribution of compounds in any MS/MS library differs substantially from the distribution of compounds of biological interest, or PubChem compounds. Furthermore, we may learn that it is not very complicated to utilize these differences, to make a metascore shine in evaluation. Yet, the actual number that we record as an “annotation rate”, has no meaning.
For the sake of simplicity, let us focus on citation counts, a feature that has repeatedly been suggested and used for metascores. Similar arguments hold for other features. Figure 1 shows the distribution of citation counts for different structure databases and libraries. We see that citation counts differ substantially between PubChem, a database containing “molecules of biological interest”, and spectral libraries. We see that most compounds from PubChem have very few citations, whereas biomolecules have larger citation counts. Notably, these are relative abundances. Every compound from the biomolecular structure databases is also found in PubChem. Yet, the vast majority of compounds in PubChem are not in the biomolecular structure database, and these compounds usually have very low citation counts. The important point here is that compounds found in a spectral library have many more citations than the average biomolecule, and even more so, than the average PubChem molecule. The reasons for that go in both directions. On the one hand, a compound that has many citations is more likely to be included in a spectral library. On the other hand, a compound that is included in a spectral library will be annotated more often in biological samples and, hence, has a much higher chance to be mentioned in a publication. These are just two reasons, we assume there are many more.
Clearly, there may be compounds in a spectral library that are not blockbuster metabolites. For example, CASMI 2016 contains MS/MS data for 4-[(4-hydroxynaphthalen-1-yl)diazenyl]benzenesulfonic acid, which has only a single citation. These compounds may be wrongly annotated by a metascore as soon as we find a blockbuster metabolite with the same mass. However, statistically speaking, such cases are the exception and not the rule. The vast majority of compounds in the CASMI 2016 spectral library are blockbuster metabolites, and using a metascore strongly improves our evaluation results.
Unfortunately, even blind competitions such as CASMI cannot avoid this issue, except by explicitly banning the use of metascores. In CASMI 2016, category #3 explicitly allowed metascores such as
Notably, not only the evaluation of a metascore method is futile. Also, using annotations from a metascore method to derive statistics about an experimental sample is futile and even misleading. As discussed, we mostly annotate blockbuster metabolites and miss out on all other compounds present in the sample. Hence, mapping those annotations to metabolic pathways or compound classes to, say, check for expression change, is not advisable. Doing so will tell us something about blockbuster metabolites and how they map to pathways and compound classes, but very little (if anything) about the data.
A method shining in evaluations does not have to be reasonable. We have deliberately chosen outrageous features for
Let us take a look behind the curtain. Why is
Potentially more dangerous are features for which, even at second thought, we cannot find a reasonable explanation of what is going on. Consider the highly counterintuitive melting point feature. This value is present only for compounds that have been analyzed in some labs and is left empty for all others. If we had filled in missing values with some reasonable value (say, the average temperature over all compounds where the value is provided), then this feature would carry little useful information for
Finally, some features are merely red herrings (say, record number modulo 42, structure complexity); we expect that removing them will not have a large impact on the performance of the metascore, if any. Yet, including them will not impair machine learning, either. Be warned that some of the red herring features may be picked up by machine learning for filtering candidates, in a way unpredictable and possibly incomprehensible for the human brain.
Any type of metascore will, to a varying degree, allow us to annotate blockbuster metabolites and only blockbuster metabolites. It does not really matter if our features are “reasonable” or “unreasonable”; machine learning will find its way to deduce “helpful information” from the data.
Other compounds become invisible. Blockbuster metabolites are somewhat similar to what is sometimes said about Hollywood stars. They need all the attention, and they cannot bear rivaling compounds. In detail, blockbuster metabolites will often overshadow other metabolites with the same mass. Irrespective of the actual MS/MS query spectrum that we use for searching, the metascore method will return the blockbuster metabolite. All compounds overshadowed by a blockbuster metabolite become basically invisible. For examples of where this goes wrong, see Table 2.
Notably, an invisible compound may not appear invisible at first glance. For example, “ethyl 4-dimethylaminobenzoate” (74 citations) was incorrectly annotated as “3,4-methylenedioxymethamphetamine” (4.5k citations, also known as “ecstasy”) in our
Unfortunately, overshadowing is not only an issue when reporting annotation statistics. Assume that you want to annotate methyl 2-pyrazinecarboxylate from Table 2 (right) based on the measured query spectrum. Good news: CSI:FingerID was actually able to annotate the correct compound structure. Bad news: If you are using a metascore, you will never get to know. The metascore replaced the correct answer with an incorrect one (namely, 4-nitroaniline), as the incorrect one was more often cited. Again, it is not citations that
The above introduces yet another issue. We might believe that we are searching PubChem when in truth, we are searching in a much smaller database. Using a metascore, we can basically drop all invisible metabolites from the searched structure database without ever changing the search results, regardless of the actual MS/MS query spectrum. Unfortunately, we do not know in which database we are actually searching and, hence, we cannot report this information. This is no good news for reproducibility or good scientific practice. As an example not from CASMI 2016, consider molecular formula C17H21NO4. PubChem contains 8384 compounds with this molecular formula. Cocaine (PubChem ID 446220) has 35,200 citations, many more than any other compound. It is nevertheless unreasonable to assume that every compound in every sample with this molecular formula is cocaine. Cocaine may even mask other highly cited compounds, such as scopolamine (PCID 3000322, 8.9 k citations) or fenoterol (PCID 3343, 1.9 k citations). However, most importantly, more than 8100 structures in PubChem with this molecular formula do not have a single citation and will be practically invisible to the metascore.
Clearly, there are queries for which overshadowing is not an issue. In particular, there are no blockbusters in the database for certain precursor masses. In these cases, the metascore method will behave similar to the original method without a metascore. Yet, this does not disprove the above reasoning: For a large fraction of queries, and particularly for biological queries, blockbusters will overshadow other compounds.
Metabolomics vs. environmental research. So far, we have concentrated on metabolomics and biological data in our argumentation. What about environmental research? There, we have highly reasonable prior information stored in the structure databases. Namely, the production volumes of compounds. The rationale behind this is that something that is not produced by mankind, should also not be annotated in an environmental sample. This appears to be very reasonable, but unfortunately, most of the arguments formulated above remain valid. Clearly, this is true for our argumentation regarding the evaluation setup, discussed in the previous section. However, also, with regard to blockbuster compounds, we find that most arguments still hold. An evaluation of a metascore using a spectral library is futile; production rates are a sensible feature to use, but that does not imply that the resulting metascore is sensible; compounds can mask other compounds with lower production volume; excellent hits can be masked by the metascore; we do not really need MS/MS data but instead, could simply use the monoisotopic mass plus the metascore; we cannot report the structure database we are searching in; and so on.
Targeted vs. untargeted analysis, and the right to refuse to annotate. Studies that employ LC–MS or LC–MS/MS for compound annotation are often referred to as “untargeted” by the authors. This is done to differentiate the instrumental setup from, say, multiple reaction monitoring experiments where we have to choose the compounds to be annotated before we start recording data. Is this nomenclature reasonable? The experimental setup is indeed not targeted to a specific set of compounds.
We argue that doing so is at least misleading, as it mixes up the employed instrumental technology, and the actual experiment we are conducting. Data analysis must be seen as an integral part of any experiment. If we use LC–MS/MS to measure our data, but then limit ourselves to a single compound that we annotate, then this is not an untargeted experiment. Our data would allow us to go untargeted, but if we do not do that, we also should not call it by this name. If we go from one to a thousand compounds, we are still targeting exactly those 1000 compounds. Metaphorically speaking, if you build a propeller engine into a car, you will not fly.
Spectral library search is a “mostly untargeted” data analysis as we restrict ourselves to compounds present in a library, but do not impose further prior knowledge. Any compound in the library has the same chance of getting annotated, as long as it fits the data. A massive advantage in comparison to the “list of masses” is that we reserve the right not to annotate certain compounds in the dataset. If the query spectra do not match anything from the spectral library, then we can label those “unknowns”. We are aware that such “unknown” labels are generally considered a pain in metabolomics research [30]. Yet, knowing that we cannot annotate a certain metabolite is definitely better than assigning a bogus annotation!
In silico methods are a logical next step for untargeted data analysis, as we extend our search to all compounds in a more comprehensive molecular structure database. Yet, even the structure database is incomplete and we might miss the correct structure. Certain methods try to overcome these restrictions altogether [31,32,33,34,35,36], and we expect this to be an active field of research in the years to come.
Using a metascore forces our in silico method back into the targeted corset, as invisible compounds can never be found by the metascore. In a way, metascore methods may even be considered “more targeted” than spectral library search, as the latter does not give different priors to compounds. Metascores are, therefore, in fundamental opposition to the central paradigm behind in silico methods. These methods were developed to overcome the issue of incomplete spectral libraries and, hence, to make data analysis “less targeted”. It is a puzzling idea to go more and less targeted at the same time. In addition, we mostly lose the possibility to flag certain compounds as “unknowns”, if we search a large database such as PubChem. As we have argued above, the knowledge of what we can and what we cannot confidently annotate is actually very useful [37].
6. Conclusions
This paper is not the first to collect recommendations on how to evaluate machine learning methods in the life sciences [38,39,40]. It is also not the first to warn about bad evaluations of machine learning models in this area [41]. Notably, Quinn [42] found that a striking 88% of gut microbiome studies violated established evaluation standards from machine learning. In fact, related warnings were published many decades ago, see for example Ransohoff and Feinstein [43] and Dreyfus and Dreyfus [44]. This paper is not the first to warn about metascores and blockbuster metabolites [45], or that more complex machine learning models may perform worse in applications [46]. See also Chapter 12 in [47]. It is not even the first to present its warnings in the form of satire [48]. Yet, by presenting
We believe that editors and reviewers are responsible to end the “reproducibility crisis” [41] where reported annotation rates of in silico methods do not correlate well with annotation rates observed in the application. If a computational method is novel, creative, and imaginative, there is no need for it to outperform all existing methods to make it publishable. Instead, creating such an artificial barrier forces evaluations into this direction, and holds off authors from evaluating and discussing the limitations of their method. A novel method may well spawn new developments that we cannot even imagine today. Yet, to judge if a computational method is novel, creative, and imaginative, you must be an expert in computational methods development. We urge editors to choose qualified reviewers, and we urge reviewers to turn down reviewing invitations if they do not feel fully qualified.
Advocates of metascores have argued that an in silico method cannot be used to identify a compound, anyways. If we just want to know which compounds to prioritize for downstream experiments, we may first go after blockbuster metabolites. This is correct, but then, we might as well prioritize compounds by price. This would likely result in a highly similar order of candidates, but avoid the false impression that the top-ranked candidate is a better explanation of the recorded data. (In fact, “price” is a confounding variable that may explain many of the observed correlations.) Next, we might want to see what the in silico method has to say before going after the blockbuster metabolite annotation. If we manually analyze the data, and even consider ordering synthetic standards to get the metabolite identified, then spending five minutes to look through the result list does not sound excessive. Be warned that using metascore annotations beyond prioritization will inevitably lead to “fake” results. Mapping those annotations to pathways or compound classes will tell us something about blockbuster metabolites and their characteristics, but hardly anything about the data and our sample.
In conclusion, we should accept that an in silico method does not have to reach 90% correct annotations to be useful. in silico methods cannot do magic; the authors sincerely doubt that MS/MS data will ever allow for such annotation rates, particularly when searching PubChem. In application, we can restrict annotations to, say, a thousand “suspects” and use the precursor m/z to annotate them. Doing so, we will wrongly annotate all compounds in our sample that also happen to have one of those masses, to be one of the suspects. Yet, we may decide that this is not a major issue for our analysis. If we decide to measure MS/MS data and want to use an in silico method, we should choose a reasonable structure database to search in. PubChem is a wonderful choice for evaluating and comparing methods, but usually a lousy choice in practice. If we notice that certain queries cannot be annotated by searching in the smaller structure database with high confidence [37], we may search those in PubChem. Finally, we may altogether avoid annotating compounds with structures. Structure annotations are not required to, say, compare the distribution of compound classes between samples [32].
S.B. designed the research. S.B. and M.A.H. developed the
Not applicable.
Not applicable.
We thank @molumen and openclipart.org for the beautiful Mad Hatter picture in the graphical abstract.
The authors declare no conflict of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1. Citations per compound for two structure databases (PubChem, biomolecules) and three spectral libraries (GNPS, MoNA, CASMI 2016 training and testing data). The biomolecular structure database is a union of several public structure database including HMDB [26], ChEBI [27], KEGG [28] and UNPD [29]. Citation counts are estimated as the number of PubMed IDs associated with the PubChem compound ID, parsed from the file “CID-PMID.gz” provided by PubChem. For the biomolecular structure database, PubChem, GNPS, and MoNA, we uniformly sub-sampled 10,000 compounds each. For every compound, we use the maximum citation count of the corresponding molecular structure (first InChIKey block). We excluded rare cases where the molecular structure was not found in PubChem. Dots are connected by lines solely for visualization purposes.
Features used by
| Used Field | Feature Description |
|---|---|
|
|
PubChem CID modulo 42 |
|
|
Record name length |
|
|
Number of information sources × XLogP value, absolute value |
|
|
Structure complexity (numerical value) |
|
|
Melting point (numerical value) |
|
|
Number of vowels in the shelf life description |
|
|
Number of consonants in a compound’s synonym list |
|
|
Number of occurrences of the word “DNA” in the compound description |
| full record | Number of words starting with the letter ‘U’ (case insensitive) in a compound’s record |
| full record | Number of occurrences of “WAS IT A CAT I SAW” in a compound’s record (case insensitive, blanks ignored) |
Examples of overshadowing. Left: CASMI 2016 challenge 157. Here, two blockbusters (ethyl 4-dimethylaminobenzoate and 3,4-methylenedioxymethamphetamine) compete with each other, and the bigger blockbuster wins. Right: CASMI 2016 challenge 89. The correct CSI:FingerID answer (methyl 2-pyrazinecarboxylate) is overshadowed by the blockbuster (4-nitroaniline), resulting in an incorrect blockbuster annotation through
| [Image omitted. Please see PDF.] | ||||
|---|---|---|---|---|
| CASMI Challenge 157 | CASMI Challenge 89 | |||
| Correct | Top Scoring | Correct | Top Scoring | |
| Compound Name | Ethyl 4-Dimethyl | 3,4-Methylenedio | Methyl 2-Pyrazine- | 4-Nitro- |
| Aminobenzoate (a) | Xymethamphetamine (b) | Carboxylate (c) | Aniline (d) | |
| PubChem CID | 25,127 | 1615 | 72,662 | 7475 |
| Number of citations | 74 | 4544 | 5 | 452 |
| PubChem CID modulo 42 | 11 | 19 | 2 | 41 |
| record name length | 29 | 33 | 28 | 14 |
| no. inf. sources × |XLogP| | 162.4 | 160.6 | 6 | 133 |
| structure complexity | 184 | 186 | 127 | 124 |
| melting point (in Kelvin) | 0 | 0 | 0 | 420.93 |
| no. vowels shelf life | 0 | 30 | 0 | 0 |
| no. consonants in synonyms | 535 | 719 | 466 | 750 |
| no. occurrences of “DNA” | 2 | 8 | 0 | 2 |
| no. words starting with ‘U’ | 153 | 377 | 84 | 403 |
| no. “WAS IT A CAT I SAW” | 136 | 420 | 86 | 455 |
| CSI:FingerID score | −189.63 | −216.47 | −80.90 | −194.67 |
References
1. Gizzi, A.S.; Grove, T.L.; Arnold, J.J.; Jose, J.; Jangra, R.K.; Garforth, S.J.; Du, Q.; Cahill, S.M.; Dulyaninova, N.G.; Love, J.D. et al. A naturally occurring antiviral ribonucleotide encoded by the human genome. Nature; 2018; 558, pp. 610-614. [DOI: https://dx.doi.org/10.1038/s41586-018-0238-4]
2. Petrick, L.M.; Shomron, N. AI/ML-driven advances in untargeted metabolomics and exposomics for biomedical applications. Cell Rep. Phys. Sci.; 2022; 3, 100978. [DOI: https://dx.doi.org/10.1016/j.xcrp.2022.100978] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35936554]
3. Krettler, C.A.; Thallinger, G.G. A map of mass spectrometry-based in silico fragmentation prediction and compound identification in metabolomics. Brief Bioinform.; 2021; 22, bbab073. [DOI: https://dx.doi.org/10.1093/bib/bbab073] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33758925]
4. O’Shea, K.; Misra, B.B. Software tools, databases and resources in metabolomics: Updates from 2018 to 2019. Metabolomics; 2020; 16, 36. [DOI: https://dx.doi.org/10.1007/s11306-020-01657-3]
5. Blaženović, I.; Kind, T.; Ji, J.; Fiehn, O. Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics. Metabolites; 2018; 8, 31. [DOI: https://dx.doi.org/10.3390/metabo8020031] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29748461]
6. Hufsky, F.; Böcker, S. Mining molecular structure databases: Identification of small molecules based on fragmentation mass spectrometry data. Mass. Spectrom. Rev.; 2017; 36, pp. 624-633. [DOI: https://dx.doi.org/10.1002/mas.21489]
7. Hufsky, F.; Scheubert, K.; Böcker, S. New kids on the block: Novel informatics methods for natural product discovery. Nat. Prod. Rep.; 2014; 31, pp. 807-817. [DOI: https://dx.doi.org/10.1039/c3np70101h] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24752343]
8. Scheubert, K.; Hufsky, F.; Böcker, S. Computational Mass Spectrometry for Small Molecules. J. Cheminform.; 2013; 5, 12. [DOI: https://dx.doi.org/10.1186/1758-2946-5-12]
9. Schymanski, E.L.; Ruttkies, C.; Krauss, M.; Brouard, C.; Kind, T.; Dührkop, K.; Allen, F.R.; Vaniya, A.; Verdegem, D.; Böcker, S. et al. Critical Assessment of Small Molecule Identification 2016: Automated Methods. J. Cheminform.; 2017; 9, 22. [DOI: https://dx.doi.org/10.1186/s13321-017-0207-1]
10. Nikolić, D.; Jones, M.; Sumner, L.; Dunn, W. CASMI 2014: Challenges, Solutions and Results. Curr. Metabolomics; 2017; 36, pp. 624-633. [DOI: https://dx.doi.org/10.2174/2213235X04666160617113437]
11. Nishioka, T.; Kasama, T.; Kinumi, T.; Makabe, H.; Matsuda, F.; Miura, D.; Miyashita, M.; Nakamura, T.; Tanaka, K.; Yamamoto, A. Winners of CASMI2013: Automated Tools and Challenge Data. Mass. Spectrom.; 2014; 3, S0039. [DOI: https://dx.doi.org/10.5702/massspectrometry.S0039]
12. Kim, S.; Thiessen, P.A.; Bolton, E.E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B.A. et al. PubChem Substance and Compound databases. Nucleic Acids Res.; 2016; 44, pp. D1202-D1213. [DOI: https://dx.doi.org/10.1093/nar/gkv951] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26400175]
13. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.; 2014; 15, pp. 1929-1958.
14. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv; 2015; arXiv: 1412.6980
15. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. et al. TensorFlow: A system for large-scale machine learning. Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016); Savannah, GA, USA, 2–4 November 2016; pp. 265-283.
16. Dührkop, K.; Shen, H.; Meusel, M.; Rousu, J.; Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl. Acad. Sci. USA; 2015; 112, pp. 12580-12585. [DOI: https://dx.doi.org/10.1073/pnas.1509788112]
17. Dührkop, K.; Fleischauer, M.; Ludwig, M.; Aksenov, A.A.; Melnik, A.V.; Meusel, M.; Dorrestein, P.C.; Rousu, J.; Böcker, S. SIRIUS 4: A rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods; 2019; 16, pp. 299-302. [DOI: https://dx.doi.org/10.1038/s41592-019-0344-8]
18. Gerlich, M.; Neumann, S. MetFusion: Integration of compound identification strategies. J. Mass. Spectrom.; 2013; 48, pp. 291-298. [DOI: https://dx.doi.org/10.1002/jms.3123]
19. Allen, F.; Greiner, R.; Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics; 2015; 11, pp. 98-110. [DOI: https://dx.doi.org/10.1007/s11306-014-0676-4]
20. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (ICML 2015); Lille, France, 6–11 July 2015.
21. Ruttkies, C.; Schymanski, E.L.; Wolf, S.; Hollender, J.; Neumann, S. MetFrag relaunched: Incorporating strategies beyond in silico fragmentation. J. Cheminform.; 2016; 8, 3. [DOI: https://dx.doi.org/10.1186/s13321-016-0115-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26834843]
22. Menikarachchi, L.C.; Cawley, S.; Hill, D.W.; Hall, L.M.; Hall, L.; Lai, S.; Wilder, J.; Grant, D.F. MolFind: A Software Package Enabling HPLC/MS-Based Identification of Unknown Chemical Structures. Anal. Chem.; 2012; 84, pp. 9388-9394. [DOI: https://dx.doi.org/10.1021/ac302048x]
23. Bach, E.; Szedmak, S.; Brouard, C.; Böcker, S.; Rousu, J. Liquid-Chromatography Retention Order Prediction for Metabolite Identification. Bioinformatics; 2018; 34, pp. i875-i883. [DOI: https://dx.doi.org/10.1093/bioinformatics/bty590]
24. Bach, E.; Rogers, S.; Williamson, J.; Rousu, J. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics; 2021; 37, pp. 1724-1731. [DOI: https://dx.doi.org/10.1093/bioinformatics/btaa998] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33244585]
25. Rutz, A.; Dounoue-Kubo, M.; Ollivier, S.; Bisson, J.; Bagheri, M.; Saesong, T.; Ebrahimi, S.N.; Ingkaninan, K.; Wolfender, J.L.; Allard, P.M. Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation. Front. Plant Sci.; 2019; 10, 1329. [DOI: https://dx.doi.org/10.3389/fpls.2019.01329] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31708947]
26. Wishart, D.S.; Feunang, Y.D.; Marcu, A.; Guo, A.C.; Liang, K.; Vázquez-Fresno, R.; Sajed, T.; Johnson, D.; Li, C.; Karu, N. et al. HMDB 4.0: The human metabolome database for 2018. Nucleic Acids Res.; 2018; 46, pp. D608-D617. [DOI: https://dx.doi.org/10.1093/nar/gkx1089]
27. Hastings, J.; de Matos, P.; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M. et al. The ChEBI reference database and ontology for biologically relevant chemistry: Enhancements for 2013. Nucleic Acids Res.; 2013; 41, pp. D456-D463. [DOI: https://dx.doi.org/10.1093/nar/gks1146] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23180789]
28. Kanehisa, M.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res.; 2016; 44, pp. D457-D462. [DOI: https://dx.doi.org/10.1093/nar/gkv1070] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26476454]
29. Gu, J.; Gui, Y.; Chen, L.; Yuan, G.; Lu, H.Z.; Xu, X. Use of natural products as chemical library for drug discovery and network pharmacology. PLoS ONE; 2013; 8, e62839. [DOI: https://dx.doi.org/10.1371/journal.pone.0062839]
30. da Silva, R.R.; Dorrestein, P.C.; Quinn, R.A. Illuminating the dark matter in metabolomics. Proc. Natl. Acad. Sci. USA; 2015; 112, pp. 12549-12550. [DOI: https://dx.doi.org/10.1073/pnas.1516878112]
31. van der Hooft, J.J.J.; Wandy, J.; Barrett, M.P.; Burgess, K.E.V.; Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl. Acad. Sci. USA; 2016; 113, pp. 13738-13743. [DOI: https://dx.doi.org/10.1073/pnas.1608041113]
32. Dührkop, K.; Nothias, L.F.; Fleischauer, M.; Reher, R.; Ludwig, M.; Hoffmann, M.A.; Petras, D.; Gerwick, W.H.; Rousu, J.; Dorrestein, P.C. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat. Biotechnol.; 2021; 39, pp. 462-471. [DOI: https://dx.doi.org/10.1038/s41587-020-0740-8]
33. Litsa, E.; Chenthamarakshan, V.; Das, P.; Kavraki, L. Spec2Mol: An end-to-end deep learning framework for translating MS/MS Spectra to de-novo molecules. ChemRxiv; 2021; [DOI: https://dx.doi.org/10.33774/chemrxiv-2021-6rdh6]
34. Kutuzova, S.; Krause, O.; McCloskey, D.; Nielsen, M.; Igel, C. Multimodal variational autoencoders for semi-supervised learning: In defense of product-of-experts. arXiv; 2021; arXiv: 2101.07240
35. Shrivastava, A.D.; Swainston, N.; Samanta, S.; Roberts, I.; Wright Muelas, M.; Kell, D.B. MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra. Biomolecules; 2021; 11, 1793. [DOI: https://dx.doi.org/10.3390/biom11121793] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34944436]
36. Stravs, M.A.; Dührkop, K.; Böcker, S.; Zamboni, N. MSNovelist: De Novo Structure Generation from Mass Spectra. Nat. Methods; 2022; 19, pp. 865-870. [DOI: https://dx.doi.org/10.1038/s41592-022-01486-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35637304]
37. Hoffmann, M.A.; Nothias, L.F.; Ludwig, M.; Fleischauer, M.; Gentry, E.C.; Witting, M.; Dorrestein, P.C.; Dührkop, K.; Böcker, S. High-confidence structural annotation of metabolites absent from spectral libraries. Nat. Biotechnol.; 2022; 40, pp. 411-421. [DOI: https://dx.doi.org/10.1038/s41587-021-01045-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34650271]
38. Chicco, D. Ten quick tips for machine learning in computational biology. BioData Min.; 2017; 10, 35. [DOI: https://dx.doi.org/10.1186/s13040-017-0155-3]
39. Walsh, I.; Fishman, D.; Garcia-Gasulla, D.; Titma, T.; Pollastri, G. ELIXIR Machine Learning Focus Group Harrow, J.; Psomopoulos, F.E.; Tosatto, S.C.E. DOME: Recommendations for supervised machine learning validation in biology. Nat. Methods; 2021; 18, pp. 1122-1127. [DOI: https://dx.doi.org/10.1038/s41592-021-01205-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34316068]
40. Palmblad, M.; Böcker, S.; Degroeve, S.; Kohlbacher, O.; Käll, L.; Noble, W.S.; Wilhelm, M. Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics. J. Proteome Res.; 2022; 21, pp. 1204-1207. [DOI: https://dx.doi.org/10.1021/acs.jproteome.1c00900]
41. Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in ML-based Science. arXiv; 2022; arXiv: 2207.07048
42. Quinn, T.P. Stool Studies Don’t Pass the Sniff Test: A Systematic Review of Human Gut Microbiome Research Suggests Widespread Misuse of Machine Learning. arXiv; 2021; [DOI: https://dx.doi.org/10.48550/arXiv.2107.03611] arXiv: 2107.03611
43. Ransohoff, D.F.; Feinstein, A.R. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N. Engl. J. Med.; 1978; 299, pp. 926-930. [DOI: https://dx.doi.org/10.1056/NEJM197810262991705]
44. Dreyfus, H.L.; Dreyfus, S.E. What artificial experts can and cannot do. AI Soc.; 1992; 6, pp. 18-26. [DOI: https://dx.doi.org/10.1007/BF02472766]
45. Böcker, S. Searching molecular structure databases using tandem MS data: Are we there yet?. Curr. Opin. Chem. Biol.; 2017; 36, pp. 1-6. [DOI: https://dx.doi.org/10.1016/j.cbpa.2016.12.010] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28025165]
46. Yaseen, A.; Amin, I.; Akhter, N.; Ben-Hur, A.; Minhas, F. Insights into performance evaluation of compound-protein interaction prediction methods. Bioinformatics; 2022; 38, pp. ii75-ii81. [DOI: https://dx.doi.org/10.1093/bioinformatics/btac496] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36124806]
47. Böcker, S. Algorithmic Mass Spectrometry: From Molecules to Masses and Back Again; Friedrich-Schiller-Universität Jena: Jena, Germany, 2019; Version 0.8.4 Available online: https://bio.informatik.uni-jena.de/textbook-algoms/ (accessed on 29 April 2022).
48. Desaire, H. How (Not) to Generate a Highly Predictive Biomarker Panel Using Machine Learning. J. Proteome Res.; 2022; 21, pp. 2071-2074. [DOI: https://dx.doi.org/10.1021/acs.jproteome.2c00117]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics usually relies on mass spectrometry, a technology capable of detecting thousands of compounds in a biological sample. Metabolite annotation is executed using tandem mass spectrometry. Spectral library search is far from comprehensive, and numerous compounds remain unannotated. So-called in silico methods allow us to overcome the restrictions of spectral libraries, by searching in much larger molecular structure databases. Yet, after more than a decade of method development, in silico methods still do not reach the correct annotation rates that users would wish for. Here, we present a novel computational method called
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





