It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
Background
Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.
Results
We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.
Conclusions
We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 The Institute of Cancer Research, Department of Data Science, London, UK (GRID:grid.18886.3f); The Institute of Cancer Research, Cancer Research UK Cancer Therapeutics Unit, London, UK (GRID:grid.18886.3f); BenevolentAI, London, UK (GRID:grid.507943.c) (ISNI:0000 0004 7536 1038)
2 The Institute of Cancer Research, Department of Data Science, London, UK (GRID:grid.18886.3f)
3 Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden (GRID:grid.418151.8) (ISNI:0000 0001 1519 6403)
4 The Institute of Cancer Research, Cancer Research UK Cancer Therapeutics Unit, London, UK (GRID:grid.18886.3f); Dunad therapeutics, Cambridge, UK (GRID:grid.18886.3f)
5 MD Anderson Cancer Center, Houston, USA (GRID:grid.240145.6) (ISNI:0000 0001 2291 4776)
6 The Institute of Cancer Research, Department of Data Science, London, UK (GRID:grid.18886.3f); The Institute of Cancer Research, Cancer Research UK Cancer Therapeutics Unit, London, UK (GRID:grid.18886.3f)