Wjcikowskiet al. J Cheminform (2015) 7:26 DOI 10.1186/s13321-015-0078-2
Open Drug Discovery Toolkit (ODDT): anew open-source player inthe drug discovery eld
Maciej Wjcikowski1, Piotr Zielenkiewicz1,2 and Pawel Siedlecki1,2*
Background
Over the past decades, in silico drug discovery has become an important element augmenting classical medicinal chemistry and high throughput screening. Many novel computational chemistry methods were developed to aid researchers in discovering promising drug candidates. In recent years, much progress has been made in areas such as scoring functions, similarity search methods and statistical approaches (for review see [1, 2]). By contrast to computational chemistry, cheminformatics remains a relatively young eld that suers from many early age diseases, such as lack of standardization, particularly regarding data interchangeability and
manipulation and reproducibility of results. To complicate the situation even more, format implementations usually have some additional, non-standard, software-oriented extensions (PDBQT is one prime example). Hardcoding a format into scientic software is also more common than using higher level toolkits, such as Open-Babel [3], RDKit [4], and OpenEye [5].
Some of the most popular and successful methods in drug discovery are structure-based. Structure-based methods are commonly employed to screen large small-molecule datasets, such as online databanks or smaller sets such as tailored combinatorial chemistry libraries. These techniques, from molecular docking to molecular mechanics to ensemble docking, employ scoring processes that are crucial for decision making. Empirical scoring functions use explicit equations based on physical properties of available ligand-receptor complexes.
*Correspondence: [email protected]
1 Institute of Biochemistry and Biophysics PAS, Pawinskiego 5a, 02-106 Warsaw, PolandFull list of author information is available at the end of the article
2015 Wjcikowski et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/
Web End =http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any
medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons
license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/
Web End =http://creativecommons.
http://creativecommons.org/publicdomain/zero/1.0/
Web End =org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
Wjcikowskiet al. J Cheminform (2015) 7:26
Page 2 of 6
Knowledge-based scoring functions may additionally or exclusively use other types of interaction quantities that are parameterized using training set(s) to t the data (for review see: [6, 7]). Currently, much eort is directed towards machine learning, which is most helpful in elucidating non-linear and non-trivial correlations in data. NNScore [8], Rfscore [9], and SFCscore [10] are among the most distinguished examples. However there are only a few freely accessible scoring functions and even fewer that are fully open source.
Analyzing output data, particularly when working with large scale virtual screening, can be a tedious and labor-demanding task that incorporates human error. Commercial software facilitate output data analysis to some extent, but there are also open source/free software solutions, such as VSDMIP [11] or DiSCuS [12], which are particularly designed for processing big data. However, the eld is still missing a coherent, open source solution that will guide the researcher in building a custom cheminformatics pipeline, tailored for specic project needs. Therefore, we sought to develop a comprehensive open source small-molecule discovery platform for both researchers designing their own pipelines or developing new drugs. To achieve this goal, we have reviewed state-of-the-art tools and algorithms and united them in one coherent toolkit. When the use of open-source tools was not possible, the algorithms were reimplemented using open source software. This approach will make the in silico discovery process more scalable, cost-eective and easier to customize. We believe, that making software open is especially important to ensure data reproducibility and to minimize technology costs. Open-source software model allows numerous individuals to contribute and collaborate, on creating opportunities for novel tools and algorithms to be developed.
Implementation
The Open Drug Discovery Toolkit (ODDT) is provided as a Python library to the cheminformatics community. We have implemented many procedures for common and more sophisticated tasks, and below we review in more detail the most prominent. We would also like to emphasize that by making the code freely available through a BSD license, we encourage other researchers and software developers to implement more modules, functions and support of their own software.
Molecule formats
Open Drug Discovery Toolkit is designed to support as many formats as possible by extending the use of Cinfony [13]. This common API unites dierent molecular tool-kits, such as RDKit and OpenBabel, and makes interacting with them more Python-like. All atom information
collected from underlying toolkits are stored as Numpy [14] arrays, which provide both speed and exibility.
Interactions
The toolkit implements the most popular protein-ligand interactions. Directional interactions, such as hydrogen bonds and salt bridges, have additional strict or crude terms that indicate whether the angle parameters are within cutos (strict) or only certain distance criteria are met (crude). The complete list of interactions implemented in ODDT consists of hydrogen bonds, salt bridges, hydrophobic contacts, halogen bonds, pi-stacking (face-to-face and edge-to-face), pi-cation, pi-metal and metal coordination. These interactions are detected using in-house functions and procedures utilizing Numpy vectorization for increased performance. Calculated interactions can be used as further (re)scoring terms. Molecular features (e.g., H-acceptors and aromatic rings) are stored as a uniform structure, which enables easy development of custom binding queries.
Filtering
Filtering small molecules by properties is implemented in ODDT. Users can use predened lters such as RO5 [15], RO3 [16] and PAINS [17]. It is also possible to apply project-specic criteria for MW, LOGP and other parameters listed in the toolkit documentation. See Example 1 in the Results and discussion section for more details on how to use ltering.
Docking
Merging free/open source docking programs into a pipeline can be a frustrating experience for many reasons. Some programs, like Autodock [18] and Autodock Vina [19], do not support multiple ligand inputs, where some other programs output scores to separate les (e.g., GOLD [20]) or even directly print to the console. Additional eort is required for re-scoring output ligand-receptor conformations in other software. Every in-silico discovery project is ooded with custom procedures and scripts to share data between programs. The docking stack within ODDT provides an easier path with the use of a common docking API. This API allows retrieving output conformations and their scores from various widely-used docking programs. The docking stack also supports multi-threading virtual screening tasks independently of underlying software, helping to utilize all available computational resources.
Scoring
Open Drug Discovery Toolkit provides a Python re-implementation of two machine learning-based functions: NNscore (version 2) and RFscore. The training
Wjcikowskiet al. J Cheminform (2015) 7:26
Page 3 of 6
sets from its original publication were used for the RFscore function [9]. For NNScore, neither the training set nor the training procedure was made available by authors, other than a brief description [8]. To bring support for NNScore, we used net [21]. The training procedure for NNscore was reimplemented in ODDT and should closely reproduce the resulting ensemble of neural networks. The training data are stored as csv les, which are used to train scoring functions locally. After the initial training procedure, the scoring function objects are stored in pickle les for improved performance.
Machine learning scoring functions consist of four main building blocks: descriptors, model, training set and test set. ODDT provides a workow for training new models, with additional support for custom descriptors and custom training and test sets. Such a design allows not only the use of the toolkit to reproduce scores (or reimplement scoring functions) but also enables the researcher to develop their own custom scoring procedures. Finally, if random seeds are dened, the scoring function results in ODDT are fully reproducible.
The ability to assess the predictive performance of scoring function (or scoring procedures) is of utmost importance. ODDT provides various ways to accomplish these tasks. One approach may use the area under receiver operating characteristics curve (ROC AUC and semi-log ROC AUC) and the enrichment factor (EF) at a dened percentage. These methods can be applied for every scoring function (and their combination) when training/test sets or active/inactive sets are supplied. Two other methods to test scoring function(s) performance include internal k-folds and leave one out / leave p out (LOO/LPO) cross-validation, both of which are particularly useful to detect model overtting. These methods are available in ODDT through the sklearn python package [22].
Statistical methods
Modeling the relationship between chemical structural descriptors and compound activities provides insight into SAR. Ultimately, such models may predict screening outcomes of novel compounds, guiding future discovery steps. Because some screening data are linear by their nature, simple regressors can be applied to nd correlations (e.g., comparative molecular eld analysis, CoMFA [23]). We implemented two straightforward regressions which that are widely used in cheminformatics, both in ligand and structure-based methods: multiple linear regression and partial least squares regression.
Nonlinear, more complex data are better assessed by machine learning models. Two forms of machine learning models are particularly important in drug discovery: (1) regressors for continuous data, such as IC50 values
or inhibition rates, and (2) classiers applied to multiple bit-wise features or ligands tagged as active/inactive (e.g., NNScore 1.0). ODDT employs sklearn as the main machine learning backend because it has a mature API and good performance. In some cases when neural networks are required, ODDT mimics the sklearn API and instead uses net [21]. The current version of our toolkit provides machine learning models that are widely used in cheminformatics and drug discovery: (1) random forests, (2) support vector machines, and (3) articial neural networks (single and multilayer). These models have been shown to provide great guidance when assessing protein-ligand complexes in the development and application of various scoring functions [810] and in SAR and QSAR (e.g., [24, 25]).
Results anddiscussions
In this section, we provide examples of ODDT usage with code snippets. Our aim is to illustrate how one can utilize the toolkit for (a) preparing data for an in silico screening procedure, (b) score and rescore protein-ligand complexes, and (c) assess data quality and performance of dierent computational approaches for elucidating statistical correlations.
Example 1: ltering, docking andrescoring workow
In this code example, the researcher uses ODDT to dock a database of ligands with Autodock Vina and rescore the results with two independent scoring functions. First, he denes how many cores are available for this task (a 0 value will force all resources to be used). Next, a ligands library is loaded and two ltering steps are applied (for weight and solubility to be consistent with Lipinskis Rule of ve [15]. After ltering, the docking engine is specied (Autodock Vina) and its parameters can be dened (here default values are used, and the docking box is centered around a crystal ligand). In this example, the docked ligand conformations are written to a le for future examination. Two scoring functions are applied to the generated ligand-receptor conformations. The re-scoring results are nally written to a generic csv le for further analysis (Figure1).
Example 2: training andevaluating models forbinding affinity datasets
In this example, the researcher is using a PDBbind data-set (ligand-receptor crystal structures along with experimentally-derived binding affinities (log Ki/Kd values) [26]. She wishes to train various prediction models on these data and then evaluate which model is the best predictor. (This workow can also be used as a template to test and develop novel scoring functions and create custom, descriptor-based machine learning models).
Wjcikowskiet al. J Cheminform (2015) 7:26
Page 4 of 6
forest classier model is trained using various nger prints implemented both in RDKit and OpenBabel.
Firstly, molecules for actives, inactives, decoys and marginal actives (treated as inactives for training) are read from SMILES les. Next, a wide range of nger-prints is built for all molecules: OpenBabel: fp1, fp2, MACCS; RDKit: rdkit (default), morgan, layered.
Secondly, a random forest classifier model is fit on all computed fingerprints, and the quality of the trained model is assessed by a correlation coefficient (R). Additionally, trained models are cross-validated
Wjcikowskiet al. J Cheminform (2015) 7:26
Page 5 of 6
to examine overfitting. From such a short analysis, one can conclude that in the presented case, morgan fingerprints yields the best results (R2 = 0.99) in classifying active molecules in the benchmarking sets taken from DUD-E (Figure4).
Conclusion
In this article, we introduce an out-of-the-box solution for building in-silico screening and data elucidation pipelines. The solution is exible and provides a selection of useful tools, some of which are implemented for the rst time. The three workows illustrated in this paper demonstrate how one can use the toolkit to quickly prepare, lter, and screen data and apply various statistical methods to elucidate relationships.
Availability andrequirements
ODDT (Open Drug Discovery Toolkit) is available at https://github.com/oddt/oddt
Web End =https://github.com/oddt/oddt
Operating system(s): platform independent
Programming language: PythonOther requirements:
at least one of the toolkits:
OpenBabel (2.3.2+), RDKit (2012.03)
Wjcikowskiet al. J Cheminform (2015) 7:26
Python (2.7+)
Numpy (1.6.2+)
Scipy (0.10+)
Sklearn (0.11+)
net (0.7.1+), only for neural network functionality.
License: 3-clause BSD,Any restrictions to use by non-academics: none.
Page 6 of 6
Additional le
Abbreviations
CADD: computer aided drug discovery; ODDT: Open Drug Discovery Toolkit; EF: enrichment factor; ROC: receiver operating characteristic; AUC: area under curve; LOO: leave one out; LPO: leave p out; USR: ultra-fast shape recognition; SAR: structure-activity relationship.
Authors contributions
MW and PS carried out the design, programming and drafted the manuscript, and PZ revised the manuscript critically for important intellectual content. All authors read and approved the nal manuscript.
Author details
1 Institute of Biochemistry and Biophysics PAS, Pawinskiego 5a, 02-106 Warsaw, Poland. 2 Department of Systems Biology, Institute of Experimental Plant Biology and Biotechnology, University of Warsaw, Miecznikowa 1, 02-096 Warsaw, Poland.
Acknowledgements
This work was supported by the Polish Ministry of Science and Higher Education (Grant No. IP2010 037470 and POIG.02.03.00-00-003/09-00) and The National Centre for Research and Development (Grant No. PBS1/A7/9/2012).
Compliance with ethical guidelines
Competing interests
The authors declare that they have no competing interests.
Received: 5 December 2014 Accepted: 21 May 2015
References
1. Vogt M, Bajorath J (2012) Chemoinformatics: a view of the eld and current trends in method development. Bioorg Med Chem 20:53175323
2. Duy BC, Zhu L, Decornez H, Kitchen DB (2012) Early phase drug discovery: cheminformatics and computational techniques in identifying lead series. Bioorg Med Chem 20:53245342
3. OBoyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminform 3:33
4. RDKit: Cheminformatics and Machine Learning Software (2013). http://www.rdkit.org
Web End =http:// http://www.rdkit.org
Web End =www.rdkit.org . Accessed 31 Nov 2014
5. OpenEye Scientic Software, Santa Fe, NM, USA. http://www.eyesopen.com
Web End =http://www.eyesopen. http://www.eyesopen.com
Web End =com . Accessed 31 Nov 2014
6. Jain AN (2006) Scoring functions for protein-ligand docking. Curr Protein Pept Sci 7:407420
7. Cheng T, Li X, Li Y, Liu Z, Wang R (2009) Comparative assessment of scoring functions on a diverse test set. J Chem Inf Model 49:10791093
8. Durrant JD, McCammon JA (2011) NNScore 2.0: a neural-network receptor-ligand scoring function. J Chem Inf Model 51:28972903
9. Ballester PJ, Mitchell JBO (2010) A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinf Oxf Engl 26:11691175
10. Zilian D, Sotrier CA (2013) SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes. J Chem Inf Model 53:19231933
11. Cabrera C, Gil-Redondo R, Perona A, Gago F, Morreale A (2011) VSDMIP 1.5: an automated structure- and ligand-based virtual screening platform with a PyMOL graphical user interface. J Comput Aided Mol Des 25:81382412. Wjcikowski M, Zielenkiewicz P, Siedlecki P (2014) DiSCuS: an open platform for (not only) virtual screening results management. J Chem Inf Model 54:347354
13. OBoyle NM, Hutchison GR (2008) Cinfonycombining Open Source cheminformatics toolkits behind a common interface. Chem Cent J 2:24
14. van der Walt S, Colbert SC, Varoquaux G (2011) The NumPy Array: a structure for efficient numerical computation. Comput Sci Eng 13:223015. Lipinski CA (2004) Lead- and drug-like compounds: the rule-of-ve revolution. Drug Discov Today Technol 1:337341
16. Congreve M, Carr R, Murray C, Jhoti H (2003) A rule of three for fragment-based lead discovery? Drug Discov Today 8:876877
17. Baell JB, Holloway GA (2010) New substructure lters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem 53:27192740
18. Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, Goodsell DS et al (2009) AutoDock4 and AutoDockTools4: automated docking with selective receptor exibility. J Comput Chem 30:27852791
19. Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31:455461
20. Jones G, Willett P, Glen RC (1995) Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation. J Mol Biol 245:4353
21. Wojciechowski M (2007) FFNET: feed-forward neural network for Python. Tech Univ Lodz Pol Lodz Pol
22. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel Oet al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:28252830
23. Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular eld analysis (CoMFA). 1. Eect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110:59595967
24. Schneider G, Downs G (2003) Editorial: machine learning methods in QSAR modelling. QSAR Comb Sci 22:485486
25. Niu B, Lu W, Yang S, Cai Y, Li G (2007) Support vector machine for SAR/ QSAR of phenethyl-amines. Acta Pharmacol Sin 28:10751086
26. Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, Nie W, Liu Y, Wang R (2015) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31:405412
Publish with ChemistryCentral and every scientist can read your work free of charge
Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge.
W. Jeffery Hurst, The Hershey Company.
available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Centralyours you keep the copyright
Submit your manuscript here: http://www.chemistrycentral.com/manuscript/
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Journal of Cheminformatics is a copyright of Springer, 2015.
Abstract
Background
There has been huge progress in the open cheminformatics field in both methods and software development. Unfortunately, there has been little effort to unite those methods and software into one package. We here describe the Open Drug Discovery Toolkit (ODDT), which aims to fulfill the need for comprehensive and open source drug discovery software.
Results
The Open Drug Discovery Toolkit was developed as a free and open source tool for both computer aided drug discovery (CADD) developers and researchers. ODDT reimplements many state-of-the-art methods, such as machine learning scoring functions (RF-Score and NNScore) and wraps other external software to ease the process of developing CADD pipelines. ODDT is an out-of-the-box solution designed to be easily customizable and extensible. Therefore, users are strongly encouraged to extend it and develop new methods. We here present three use cases for ODDT in common tasks in computer-aided drug discovery.
Conclusion
Open Drug Discovery Toolkit is released on a permissive 3-clause BSD license for both academic and industrial use. ODDT's source code, additional examples and documentation are available on GitHub (https://github.com/oddt/oddt).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer