Abstract

Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.

Details

Title
SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials
Author
Eastman, Peter 1   VIAFID ORCID Logo  ; Behara, Pavan Kumar 2 ; Dotson, David L. 3   VIAFID ORCID Logo  ; Galvelis, Raimondas 4   VIAFID ORCID Logo  ; Herr, John E. 5 ; Horton, Josh T. 6   VIAFID ORCID Logo  ; Mao, Yuezhi 1   VIAFID ORCID Logo  ; Chodera, John D. 7 ; Pritchard, Benjamin P. 8 ; Wang, Yuanqing 9 ; De Fabritiis, Gianni 10 ; Markland, Thomas E. 1 

 Stanford University, Department of Chemistry, Stanford, USA (GRID:grid.168010.e) (ISNI:0000000419368956) 
 University of California, Department of Pharmaceutical Sciences, Irvine, USA (GRID:grid.266093.8) (ISNI:0000 0001 0668 7243) 
 Open Molecular Software Foundation, The Open Force Field Initiative, Davis, USA (GRID:grid.266093.8) 
 Acellera Labs, Barcelona, Spain (GRID:grid.266093.8) 
 University of Notre Dame, Department of Chemistry and Biochemistry, Notre Dame, USA (GRID:grid.131063.6) (ISNI:0000 0001 2168 0066) 
 Newcastle University, School of Natural and Environmental Sciences, Newcastle upon Tyne, United Kingdom (GRID:grid.1006.7) (ISNI:0000 0001 0462 7212) 
 Memorial Sloan Kettering Cancer Center, Computational and Systems Biology Program, Sloan Kettering Institute, New York, USA (GRID:grid.51462.34) (ISNI:0000 0001 2171 9952) 
 Virginia Polytechnic Institute and State University, Molecular Sciences Software Institute, Blacksburg, USA (GRID:grid.438526.e) (ISNI:0000 0001 0694 4940) 
 Memorial Sloan Kettering Cancer Center, Computational and Systems Biology Program, Sloan Kettering Institute, New York, USA (GRID:grid.51462.34) (ISNI:0000 0001 2171 9952); Weill Cornell Graduate School of Medical Sciences, Graduate Program in Physiology, Biophysics, and Systems Biology, New York, USA (GRID:grid.5386.8) (ISNI:000000041936877X) 
10  Acellera Labs, Barcelona, Spain (GRID:grid.5386.8); Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003, Barcelona, Spain and ICREA, Barcelona, Spain (GRID:grid.5612.0) (ISNI:0000 0001 2172 2676) 
Pages
11
Publication year
2023
Publication date
2023
Publisher
Nature Publishing Group
e-ISSN
20524463
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2760730352
Copyright
© The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.