Towards Compilation of Balanced Protein Stability

Abstract

Protein stability datasets contain neutral mutations that are highly concentrated in a much narrower ΔΔG range than destabilizing and stabilizing mutations. Notwithstanding their high density, often studies analyzing stability datasets and/or predictors ignore the neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations would affect the quality of stability datasets, we have explored three protein stability datasets; S2648, PON-tstab and the symmetric S^sym that differ in size and quality. A characteristic leptokurtic shape in the ΔΔG distributions of all three datasets including the curated and symmetric ones were reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on ΔΔG predictions, we have comprehensively assessed the performance of eleven predictors on the PON-tstab dataset. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations while their performance became gradually worse as the ΔΔG of the mutations departed further from the neutral zone regardless of the direction, implying a bias towards dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability datasets, we described a systematic under-sampling approach to balance the ΔΔG distributions. Before under-sampling, mutations were clustered based on their biochemical and/or structural features and then three mutations were systematically selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and ΔΔG distributions. All subsets notably showed amelioration of not only the shape of ΔΔG distributions but also other pre-existing imbalances in the frequency distributions. We also reported differences in the performance of the predictors between the parent and under-sampled subsets due to the enrichment of previously under-represented mutations in the subsets. Altogether, this study not only elaborated the pivotal role of concentrated mutations in the dataset biases but also contemplated and realized a rational strategy to tackle this and other forms of biases. Under-sampling code is available (https://github.com/narodkebabci/gRoR).

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

* https://github.com/narodkebabci/gRoR

* https://bit.ly/3xNg0tr

Details

Title

Towards Compilation of Balanced Protein Stability Datasets: Flattening the ΔΔG Curve through Systematic Under-sampling

Author

Narod Kebabci; Ahmet Can Timucin; Timucin, Emel

University/institution

Cold Spring Harbor Laboratory Press

Section

New Results

Publication year

2021

Publication date

Sep 20, 2021

Publisher

Cold Spring Harbor Laboratory Press

ISSN

2692-8205

Source type

Working Paper

Language of publication

English

DOI

https://doi.org/10.1101/2021.09.17.460216

ProQuest document ID

2574559279

© 2021. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Towards Compilation of Balanced Protein Stability Datasets: Flattening the ΔΔG Curve through Systematic Under-sampling

Jump to:

Abstract

Details

Suggested sources