1. Introduction
The AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids [1]. The AAindex is widely applied across various research fields, including bioinformatics, computational biology, and molecular biology (the three original AAindex manuscripts have been cited more than 2500 times in total). Specifically, it finds uses in the following research activities:
Studies of protein–protein interactions, by offering the physicochemical properties of amino acids [2,3].
Evolutionary biology, specifically changes in proteins, especially in understanding how amino acid substitutions can impact protein function over time [4,5,6,7].
Mutational analysis, by enabling one to understand how point mutations that alter amino acid sequences affect a protein’s properties, stability, or functionality [8,9,10,11,12,13].
Enzyme studies, by modeling the enzyme activity, stability, and specificity based on the amino acid properties, supporting both experimental and theoretical enzyme research [14,15,16,17].
Protein structure prediction, by providing numerical values for amino acid properties such as the hydrophobicity, polarity, or molecular weight, which can be crucial in predicting the secondary, local, and tertiary protein structures [18,19,20,21,22,23].
Drug design and molecular docking, by providing the required values for the binding affinities between the proteins and drug molecules based on amino acid properties (these affinities can be critical to designing molecules that can effectively bind or inhibit specific proteins) [24,25,26,27].
Protein function annotation, by comparing the amino acid properties with those of known proteins, facilitating classification based on their physicochemical characteristics [28,29,30,31].
Sequence alignment and homology modeling, by incorporating amino acid substitution matrices into alignment algorithms that reflect the physicochemical differences between amino acids (this can improve the accuracy of sequence homology models that compare proteins according to their functional or structural similarity) [32,33,34,35].
Machine learning and predictive models in proteomics, via the application of physicochemical properties from the AAindex as features in various machine learning models that predict the protein structure, function, folding patterns, or interaction patterns [21,36,37,38].
In addition to theoretical methods for studying protein–protein and DNA–protein interactions, experimental approaches are also available. Under certain circumstances, these interactions can be measured directly using advanced techniques such as bioluminescence resonance energy transfer (BRET) [39,40] and atomic force microscopy (AFM) [41]. These methods provide the distinct advantage of the direct measurement of specific molecular complexes under study. However, it can be challenging to generalize the data obtained through these methods, and they cannot be explicitly used in theoretical studies or for predicting the structures and properties of proteins and biomolecular complexes. Moreover, these methods cannot be employed for examining compounds that have yet to be synthesized or isolated (for instance, in drug design). Consequently, comprehensive databases of biological macromolecule constituents remain required for many research areas.
The AAindex database was initially posited as a source of physicochemical properties for the 20 canonical amino acids, which are encoded by the triplet codons of the genetic code [1]. Over time, it became clear that proteins include a noticeable amount (more than 1000) of non-canonical amino acids (ncAAs), which play a role in metabolism and take part in signal transduction. Canonical amino acid modifications in vivo are generated through post-translational modifications, backbone alterations (β- and γ-amino acids) [42,43], and stereochemical inversions (D-amino acids) [44]. The number of ncAAs is also significantly increased by chemical syntheses [45]. Many ncAAs inherently resist proteolytic degradation, thus making them potential protease inhibitors and research tools for profiling studies in substrate optimization and enzyme inhibition. ncAAs can also be applied in various fields of research, such as pharmacokinetics and peptidomimetics [46], the thermal stabilization of enzymes [47], enzyme kinetics, protease research [44], molecular interactions, bioimaging, structure–function studies (especially ncAAs with bio-orthogonal labels), and even photo-control to set the protein activity to ON or OFF [48].
Given the broad application of ncAAs in research, accurate physicochemical data for ncAAs are critical across computational molecular modeling tasks. The incorporation of ncAA-specific parameters into traditional force fields through expansion and reparameterization provides the means for the precise modeling of novel side chain interactions, modified backbone conformations, and unique electronic effects [49,50]. These physicochemical properties are also essential for predicting the protein and peptide structure and function, where the hydrophobicity, charge, and size directly affect the folding, stability, and protein complex interactions [51]. Precise ncAA models enhance the structural motif analysis and protein engineering that is essential for stable protein-based drug designs and designing enzymes with novel functions [52]. Furthermore, accurate ncAA physicochemical data improve molecular dynamics simulations via expanded force fields, thereby enhancing the predictions of protein conformations and interaction sites. Machine learning models can also benefit from incorporating these properties [53].
Since ncAAs are, by definition, non-standard, they are absent from most databases. As a result, the design of proteins incorporating ncAAs, functional predictions, and their overall characterization predominantly depend on data from canonical amino acids. The purpose of this study was to develop an algorithm for the accurate prediction of all the physicochemical properties from the AAindex database and to apply it to all known ncAAs, thereby creating an extension of the AAindex database.
2. Results
We kindly recommend that readers start with the Methods section to gain a deeper understanding of this methodological manuscript.
We developed a method to evaluate the physicochemical properties from the AAindex database for non-canonical amino acids and assessed its prediction accuracy. The method was implemented as a bioinformatics command-line tool, accessible via the website
As described in Section 4.3, the quality of the predictions was evaluated using the Pearson’s correlation coefficient between the experimental and predicted values for a specific physicochemical property across the 20 canonical amino acids. A learning model was constructed for each of the 566 physicochemical properties from the AAindex database. The quality of the predictions made by each learning model was evaluated statistically in order to assess their accuracy and to estimate the root mean square error of the predicted property.
A command line, cross-platform bioinformatics tool, written in C++, was developed to predict each of the AAindex’s 566 physicochemical properties based on the generated learning models from the SMILES encoding of any ncAA, including those not yet known or synthesized. This tool was applied for each ncAA obtained from the PDB that was present in the PDBeChem databank and did not contain the elements As, B, Br, Cl, F, I, P, or Se (see Section 4.2). The resulting database of the predicted properties is available at
We also assessed the prediction quality for each physicochemical property from the AAindex database using the approach described in Section 4.3. The top 20 best predicted physicochemical properties from the AAindex database are presented in Table 1. This table provides insight into the reliability of the models developed for each physicochemical property. A higher correlation coefficient obtained during model optimization (in the leave-one-out mode) indicates greater reliability in predicting the given property for ncAAs. Additionally, the root mean square error (RMSE) for the selected property is presented, along with the number of significant predictors included in the model and the Fisher statistic threshold at which the model achieved the maximum correlation coefficient.
The complete version of Table 1, which encompasses all 566 physicochemical properties from the AAindex database, is presented in Supplemental Table S1.
A moderate level of correlation (rj-n > 0.515, p-value < 0.01) was achieved for the prediction of 227 properties, while an average level of correlation (rj-n > 0.378, p-value < 0.05) was obtained for 322 properties. The method we developed performs significantly better in predicting the physicochemical properties that are measured via experiments and less effectively for those that can be obtained through straightforward computation. Notably, 72.7% of the hydrophobicity-related properties were predicted with rj-n > 0.5 (p < 0.012), and 93.9% of these properties were predicted with at least an average level of correlation rj-n > 0.37 (mean rj-n for all hydrophobicity values = 0.621, mean RMSE for all hydrophobicity values = 0.239) (Table 2).
In order to ascertain which components of an amino acid contribute to a particular physicochemical property, it is possible to analyze which predictors were found to be statistically significant for predicting that property. In the case of the EISD840101 (consensus normalized hydrophobicity scale) [54] property, the most statistically significant predictors were the number of oxygen atoms, the number of positive charges, the number of nitrogen atoms, and the number of carbon atoms in aromatic rings. The learning model for the EISD840101 property is presented in Table 3.
The identification of amino acid components that exert the greatest influence on a given physicochemical property allows for the design, creation, or modification of ncAAs with enhanced or reduced desired properties (e.g., hydrophobicity) through chemical synthesis. Information regarding the significance of specific predictors is typically unavailable in predictions made by neural networks.
Let us consider, e.g., how the prediction of the EISD840101 property value for the HYP (4-HYDROXYPROLINE) ncAA is performed. OpenEye’s SMILES for HYP can be recorded as O[C@H]1CN[C@@H](C1)C(O)=O, and the predictors are represented as shown in Table 4, which contains the predictor values corresponding to this SMILES, along with the corresponding regression coefficients and the constant term.
Omitting the zero terms, EISD840101 (HYP) = (−0.563688) × 3 + (−0.556289) × 1 + 0.11307 × 5 + (−0.290756) × 1 + 1.629829 = −0.34293. The RMSE value for this property, calculated for the 20 canonical amino acids, is 0.282 (Supplemental Table S1). When predicting the physicochemical properties of ncAAs, we utilized the RMSE (the most commonly applied error function) to fit the prediction model. So, the predicted value for EISD840101 (HYP) is −0.343 ± 0.282. Similarly, all the physicochemical properties from the AAindex database were calculated for all ncAAs.
Thus, our database comprises a series of correlation coefficients (which can be used as a measure of the prediction quality), regression coefficients, their associated statistical significance (F-values), RMSE values, and a comprehensive list of predictors forming the model for each physicochemical property in the AAindex database. This dataset may be downloaded for local analysis or accessed via a web browser.
3. Discussion
In this study, we introduced a method for evaluating the physicochemical properties listed in the AAindex database.
Some properties, such as FASG760101 (molecular weight) [55] or CHAM830106 (the number of bonds in the longest chain) [56], can be calculated straightforwardly for any molecule based on its chemical composition. The AAindex database also contains many properties derived from the amino acid occurrence analysis (e.g., DAYM780101, the amino acid composition [57]) or the conformational preferences of amino acids (e.g., CHOP780101, the normalized frequency of beta-turns [58]). Predicting such physicochemical properties is unnecessary and of little interest, since they can be computed straightforwardly.
In contrast, the situation is entirely different for the physicochemical properties obtained from experiments. The AAindex database includes a wide variety of scales for hydrophobicity, the energy of transfer between mediums, polarizability, isoelectric points, solvation-related properties, etc. The accurate prediction of these properties can significantly enhance the learning models for proteins containing ncAAs, particularly when these models employ amino acid physicochemical properties from the AAindex database as predictors [18,21].
An example of the necessity for using physicochemical properties for ncAAs can be seen in our previous work [21], where we encountered challenges in constructing an adequate learning model for protein secondary and local structure prediction, using the physicochemical properties from the AAindex databases (specifically, hydrophobicity) as predictors. The prediction accuracy was significantly lower for collagen and for globular proteins containing collagen-like regions. This may be due to the high proportion of the ncAA hydroxyproline in collagen and collagen-like regions, as it seems that hydroxyproline plays a key role in stabilizing the structure of collagen [59]. Moreover, a quantitative relationship has been identified between collagen melting temperatures across various species and the percentage of hydroxyproline residues [60], supporting hydroxyproline’s role in stabilizing collagen-like structures, as structural stability is closely tied to the melting temperature. Despite recent advances, the accurate prediction of collagen and collagen-like conformations remains challenging [61]. We propose that one reason for these difficulties is the lack of distinction between proline and hydroxyproline in both protein sequence and physicochemical property databases, e.g., the abovementioned property EISD840101 (consensus normalized hydrophobicity scale) reveals a fivefold difference between measured values for proline and predicted values for hydroxyproline (−0.07 for proline and −0.34 for hydroxyproline, with rj-n = 0.924). An even more pronounced distinction appears for the ROSM880101 (side chain hydropathy, uncorrected for solvation) [62] property, where the value for proline is −1.75, while the predicted value for hydroxyproline is 2.96, with rj-n = 0.973.
We plan to refine learning models for both local and secondary protein structure prediction by incorporating the predicted physicochemical properties of hydroxyproline and other ncAAs (using predictions with rj-n ≥ 0.6) in our future research.
4. Materials and Methods
4.1. Formulation of the Problem
In order to formalize the problem of predicting the physicochemical properties from the AAindex database for ncAAs, it is necessary to include information on the components shared between canonical and non-canonical amino acids in the prediction model. In particular, for each canonical amino acid, a feature set corresponding to its chemical composition must be generated. Subsequently, this set can then be correlated with a physicochemical property from the AAindex database. This defines the problem, which we solved using stepwise regression analysis.
As a feature set, we applied a set of predictors, derived from the SMILES (Simplified Molecular Input Line Entry) encoding [63,64] for each amino acid. SMILES is a string notation that is used to describe the structure of chemical compounds using short sequences [63,64,65]. These SMILES strings can be conveniently imported by the majority of computer molecule editors and converted back into two-dimensional diagrams or three-dimensional molecular models. SMILES encoding is also a widely utilized method for generating features in problems related to the prediction of chemical structure and function. This includes applications with RDKit [66], Dragon [67], CDK2 [68], PyDescriptor [69], and others. The substitution of chemical compounds with their components is successfully applied in machine learning, including the use of SMILES encoding [70,71,72,73]. SMILES can be utilized for classification on the basis of images of chemical compounds [74].
Among the non-canonical amino acids found in the PDB [75], some contain chemical elements not present in the 20 canonical amino acids, for which the AAindex database was created. It was necessary to exclude these amino acids from further consideration.
4.2. SMILES for Canonical Amino Acids
The learning model was constructed using all the canonical amino acids. Based on the SMILES encoding for these amino acids, we selected the features that described each amino acid in the most accurate way (using the statistical assessment described below), which were then mapped to the values of each of the 566 properties from the AAindex database. There are multiple standards for SMILES notation. Given our focus on isomeric properties, particularly the presence of chiral centers, we utilized the most recent version of the SMILES standard that incorporates these features (canonical SMILES calculated using the OpenEye OEToolkit version 1.5.0). The SMILES for all the canonical amino acids were obtained from the PDBeChem server [76] via the following URL:
4.3. SMILES for Non-Canonical Amino Acids
The complete (to September of 2024) Protein Data Bank, PDB [75], was downloaded from
Table 5 lists the 25 most frequently occurring (more than 70% of the total) ncAAs predicted using our method.
The complete occurrence table for all non-canonical amino acids can be found in Supplemental Table S3.
We selected the SMILES components shown in Table 6 to create the features of the prediction model.
The selection of the optimal set for generating predictors was conducted by testing various sets of predictors (including amino acid component frequencies, amino acid component occurrence polynomial functions, reverse component frequencies, etc.). As a result, the best learning model in terms of the overall performance was achieved using the predictors representing the frequency of the components listed in Table 2, e.g., for alanine (ALA), the predictor corresponding to component ‘C’ is 2, while the predictor corresponding to the component ‘S’ is 0.
In this way, we created a learning model where the dependent variable was a physicochemical property from the AAindex database, and the feature set was the same across all amino acids.
Stepwise regression analysis was used to create learning models describing the relationship between the input feature set and each of the 566 properties in the AAindex database. The physicochemical property FAUJ880111 (positive charge) [78] exhibited a perfect correlation with the predictor, which represents the number of positive charges. Consequently, the calculation of the statistical properties, such as the standard deviation of the regression coefficients, is not feasible and has no meaning. In the case of a perfect correlation, the standard deviation equals zero, and by definition, the Fisher statistic is calculated as the square of the ratio between the regression coefficient and the standard deviation, making the statistic undefined.
Due to the small sample size, comprising only 20 canonical amino acids, we were unable to apply neural networks to predict the physicochemical properties of ncAAs, as the neural networks trained on such a small dataset would suffer from overfitting [79]. Statistical assessments were performed to identify the significant predictors for each amino acid–physicochemical property relationship. These learning models provided templates for the prediction of the ncAAs’ properties.
4.4. The Selection of Statistically Significant Predictors and Prediction Quality Statistical Assessment
The final training dataset was relatively small, comprising only 20 standard amino acids, while the number of predictors (13) was comparable to the number of amino acids. Learning models created under such conditions can recognize the existing data satisfactorily but perform moderately when predicting unknown data. To assess the prediction quality, we generated auxiliary models where 1 amino acid was removed from the dataset: based on the subset of 19 canonical amino acids, the given physicochemical property of the 20th amino acid was predicted. The selection of significant predictors for prediction was determined by the value of the Fisher statistic: the predictors were considered significant if, when included in the learning model, the Fisher statistic value for each predictor within the model exceeded the current threshold. By performing this procedure for all 20 canonical amino acids, we obtained the predicted values for the given physicochemical property and calculated the correlation coefficient between the predicted and actual values. The threshold for the F-statistic (F-value) was determined via leave-one-out cross validation. The choice of the optimal F-value for the physicochemical property EISD840101 (consensus normalized hydrophobicity scale) [54] is shown in Table 7.
This procedure was repeated for all the physicochemical properties from the AAindex database.
5. Conclusions
We suggest that the obtained results will be of significant interest for the detailed prediction and analysis of the structure and function of both native and synthetic proteins containing ncAAs. Furthermore, the AAindex physicochemical properties are especially valuable for studies involving small proteins and protein ligands that incorporate ncAAs.
The method that we developed is general and limited only by the absence of experimental data for ncAAs containing chemical elements that are not incorporated in canonical amino acids (As, B, Br, Cl, F, I, P, and Se, at the time of publication). As experimental data for these physicochemical properties are obtained, the learning models can be retrained to include ncAAs with these chemical elements, and the set of statistically significant predictors will be appropriately expanded, so the database can be expanded. We plan to update both the database and learning models as such experimental data become available in the future.
The developed tools (including the source code) and the database are freely accessible at
The relevance and importance of our work is derived from the increasing acknowledgement of the potential of non-canonical amino acids in enzymology, biocatalysis, and biological therapeutics. We anticipate that our results and the software we developed will facilitate useful theoretical predictions, thereby serving as a foundation for screening studies in these fields.
Conceptualization, Y.V.M., G.I.K. and Y.V.K.; methodology, Y.V.M., G.I.K. and Y.V.K.; software, Y.V.M. and Y.V.K.; validation, Y.V.M.; statistical assessments: Y.V.M.; writing—original draft preparation, Y.V.M. and Y.V.K.; writing—review and editing, Y.V.M., G.I.K. and Y.V.K.; website, Y.V.K. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The source code for all scripts used in this work, the final learning model, and the precomputed database of the physicochemical properties of non-canonical amino acids are available at
We are thankful for the Centre for Precision Genome Editing and Genetic Technologies for Biomedicine for access to computer resources facilitating us to conduct this study.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
The top 20 best predicted physicochemical properties from the AAindex database.
AAindex Accession | r j-n | RMSE | F-Value of rj-n | P num |
---|---|---|---|---|
CHAM820101 | 0.999 | 0.005 | 1.2 | 10 |
KARS160117 | 0.994 | 1.820 | 2.0 | 8 |
FAUJ880103 | 0.989 | 0.287 | 1.1 | 10 |
LEVM760105 | 0.989 | 0.070 | 2.1 | 6 |
BIGC670101 | 0.986 | 4.580 | 1.0 | 9 |
GOLD730102 | 0.985 | 6.860 | 6.9 | 5 |
KARS160107 | 0.982 | 0.748 | 4.0 | 6 |
CHOC750101 | 0.976 | 9.540 | 1.5 | 9 |
FASG760101 | 0.975 | 6.920 | 1.4 | 10 |
KARS160101 | 0.975 | 0.526 | 3.4 | 6 |
ROSM880101 | 0.973 | 1.230 | 1.7 | 9 |
TSAJ990101 | 0.970 | 9.920 | 2.3 | 7 |
LEVM760102 | 0.966 | 0.255 | 3.3 | 6 |
TSAJ990102 | 0.966 | 10.600 | 3.2 | 7 |
KRIW790103 | 0.963 | 9.900 | 4.6 | 5 |
KUHL950101 | 0.958 | 0.095 | 3.3 | 3 |
ZIMJ680104 | 0.958 | 0.850 | 1.0 | 7 |
CHOC760101 | 0.957 | 12.500 | 1.0 | 7 |
KARS160114 | 0.954 | 2.210 | 1.5 | 6 |
KARS160102 | 0.952 | 0.740 | 3.4 | 4 |
The complete list of physicochemical properties from the AAindex database is available at
Prediction quality of hydrophobicity-related physicochemical properties.
AAindex Accession | r j-n | Property Explanation |
---|---|---|
KUHL950101 | 0.958 | Hydrophilicity scale |
WOLR790101 | 0.954 | Hydrophobicity index |
EISD840101 | 0.924 | Consensus normalized hydrophobicity scale |
KIDA850101 | 0.903 | Hydrophobicity-related index |
PRAM900101 | 0.888 | Hydrophobicity |
ENGD860101 | 0.887 | Hydrophobicity index |
BLAS910101 | 0.873 | Scaled side chain hydrophobicity values |
GOLD730101 | 0.828 | Hydrophobicity factor |
COWR900101 | 0.808 | Hydrophobicity index, 3.0 pH |
CIDH920102 | 0.802 | Normalized hydrophobicity scales for beta-proteins |
JURD980101 | 0.774 | Modified Kyte–Doolittle hydrophobicity scale |
WILM950101 | 0.714 | Hydrophobicity coefficient in RP-HPLC, C18 with |
CIDH920105 | 0.697 | Normalized average hydrophobicity scales |
ARGP820101 | 0.659 | Hydrophobicity index |
JOND750101 | 0.646 | Hydrophobicity |
CIDH920104 | 0.625 | Normalized hydrophobicity scales for alpha-/beta-proteins |
CASG920101 | 0.600 | Hydrophobicity scale from native protein structures |
PONP800106 | 0.589 | Surrounding hydrophobicity in turn |
CIDH920103 | 0.563 | Normalized hydrophobicity scales for alpha-proteins |
SWER830101 | 0.563 | Optimal matching hydrophobicity |
PONP800103 | 0.545 | Average gain ratio in surrounding hydrophobicity |
CIDH920101 | 0.537 | Normalized hydrophobicity scales for alpha-proteins |
MANP780101 | 0.512 | Average surrounding hydrophobicity |
PONP930101 | 0.501 | Hydrophobicity scales |
FASG890101 | 0.479 | Hydrophobicity index |
PONP800105 | 0.476 | Surrounding hydrophobicity in beta-sheet |
PONP800102 | 0.458 | Average gain in surrounding hydrophobicity |
ZIMJ680101 | 0.455 | Hydrophobicity |
PONP800101 | 0.445 | Surrounding hydrophobicity in folded form |
WILM950102 | 0.427 | Hydrophobicity coefficient in RP-HPLC, C8 with |
WILM950103 | 0.368 | Hydrophobicity coefficient in RP-HPLC, C4 with |
PONP800104 | 0.242 | Surrounding hydrophobicity in alpha-helix |
WILM950104 | −0.193 | Hydrophobicity coefficient in RP-HPLC, C18 with |
The complete list of physicochemical properties from the AAindex database is available at
Learning model of AAindex property EISD840101.
F-Value | Regression | SD of RC | Predictor |
---|---|---|---|
507.614 | −0.563688 | 0.025 | O |
120.263 | −0.556289 | 0.050 | N |
9.014 | 0.163592 | 0.054 | n |
11.118 | 0.452579 | 0.135 | N= |
56.734 | 0.113070 | 0.015 | C |
111.808 | 0.148830 | 0.014 | c |
22.228 | −0.248443 | 0.052 | S |
68.744 | −0.290756 | 0.035 | c1 |
44.738 | −0.688098 | 0.102 | c2 |
189.686 | −1.168294 | 0.084 | + |
Constant term | 1.630 |
F-value: the value of the F-statistic used as the threshold for a predictor to be included in the learning model; regression coefficient values were obtained through the stepwise regression procedure; SD of RC: the standard deviation of the regression coefficient across the 20 canonical amino acids.
The calculation of the EISD840101 property value for the 4-HYDROXYPROLINE ncAA.
Component | Predictor’s Value | Regression |
---|---|---|
=O | 1 | 0.000000 |
O | 3 | −0.563688 |
N | 1 | −0.556289 |
n | 0 | 0.163592 |
=N | 0 | 0.452579 |
C | 5 | 0.113070 |
c | 0 | 0.148830 |
[C@ | 2 | 0.000000 |
S | 0 | −0.248443 |
C1 | 1 | −0.290756 |
C2 | 0 | −0.688098 |
= | 1 | 0.000000 |
+ | 0 | −1.168294 |
Constant term | 1.629829 |
The 25 most frequently occurring ncAAs in the PDB, predicted using the suggested method.
Code | Amino Acid | SMILES | Number | Percent |
---|---|---|---|---|
MLY | N-DIMETHYL-LYSINE | CN(C)CCCC[C@H](N)C(O)=O | 5324 | 20.151 |
HYP | 4-HYDROXYPROLINE | O[C@H]1CN[C@@H](C1)C(O)=O | 2264 | 8.569 |
NAG | 2-ACETAMIDO-2-DEOXY-BETA-D-GLUCOPYRANOSE | CC(=O)N[C@H]1[C@H](O)O[C@H](CO)[C@@H](O)[C@@H]1O | 983 | 3.721 |
CSO | S-HYDROXYCYSTEINE | N[C@@H](CSO)C(O)=O | 890 | 3.369 |
CRO | {2-[(1R,2R)-1-AMINO-2-HYDROXYPROPYL]-4-(4-HYDROXYBENZYLIDENE)-5-OXO-4,5-DIHYDRO-1H-IMIDAZOL-1-YL}ACETIC ACID | C[C@@H](O)[C@H](N)C1=N\C(=C/c2ccc(O)cc2)C(=O)N1CC(O)=O | 886 | 3.354 |
KCX | LYSINE NZ-CARBOXYLIC ACID | N[C@@H](CCCCNC(O)=O)C(O)=O | 877 | 3.319 |
PCA | PYROGLUTAMIC ACID | OC(=O)[C@@H]1CCC(=O)N1 | 783 | 2.964 |
CME | S,S-(2-HYDROXYETHYL)THIOCYSTEINE | N[C@@H](CSSCCO)C(O)=O | 679 | 2.570 |
CSD | 3-SULFINOALANINE | N[C@@H](C[S](O)=O)C(O)=O | 617 | 2.335 |
NRQ | {(4Z)-4-(4-HYDROXYBENZYLIDENE)-2-[3-(METHYLTHIO)PROPANIMIDOYL]-5-OXO-4,5-DIHYDRO-1H-IMIDAZOL-1-YL}ACETIC ACID | CSCCC(=N)C1=N\C(=C/c2ccc(O)cc2)C(=O)N1CC(O)=O | 604 | 2.286 |
CR2 | {(4Z)-2-(AMINOMETHYL)-4-[(4-HYDROXYPHENYL)METHYLIDENE]-5-OXO-4,5-DIHYDRO-1H-IMIDAZOL-1-YL}ACETIC ACID | NCC1=N\C(=C/c2ccc(O)cc2)C(=O)N1CC(O)=O | 521 | 1.972 |
CGU | GAMMA-CARBOXY-GLUTAMIC ACID | N[C@@H](CC(C(O)=O)C(O)=O)C(O)=O | 507 | 1.919 |
GYC | [(4Z)-2-[(1R)-1-AMINO-2-MERCAPTOETHYL]-4-(4-HYDROXYBENZYLIDENE)-5-OXO-4,5-DIHYDRO-1H-IMIDAZOL-1-YL]ACETIC ACID | N[C@@H](CS)C1=N\C(=C/c2ccc(O)cc2)C(=O)N1CC(O)=O | 396 | 1.499 |
CRQ | [2-(3-CARBAMOYL-1-IMINO-PROPYL)-4-(4-HYDROXY-BENZYLIDENE)-5-OXO-4,5-DIHYDRO-IMIDAZOL-1-YL]-ACETIC ACID | NC(=O)CCC(=N)C1=N\C(=C/c2ccc(O)cc2)C(=O)N1CC(O)=O | 381 | 1.442 |
MDO | {2-[(1S)-1-AMINOETHYL]-4-METHYLIDENE-5-OXO-4,5-DIHYDRO-1H-IMIDAZOL-1-YL}ACETIC ACID | C[C@H](N)C1=NC(=C)C(=O)N1CC(O)=O | 356 | 1.347 |
OCS | CYSTEINESULFONIC ACID | N[C@@H](C[S](O)(=O)=O)C(O)=O | 352 | 1.332 |
FME | N-FORMYLMETHIONINE | CSCC[C@H](NC=O)C(O)=O | 300 | 1.136 |
ORN | L-ORNITHINE | NCCC[C@H](N)C(O)=O | 299 | 1.132 |
ABA | ALPHA-AMINOBUTYRIC ACID | CC[C@H](N)C(O)=O | 284 | 1.075 |
CR8 | 2-[1-AMINO-2-(1H-IMIDAZOL-5-YL)ETHYL]-1-(CARBOXYMETHYL)-4-[(4-OXOCYCLOHEXA-2,5-DIEN-1-YLIDENE)METHYL]-1H-IMIDAZOL-5-OLATE | N[C@@H](Cc1[nH]cnc1)c2nc(C=C3C=CC(=O)C=C3)c([O-])n2CC(O)=O | 279 | 1.056 |
TYS | O-SULFO-L-TYROSINE | N[C@@H](Cc1ccc(O[S](O)(=O)=O)cc1)C(O)=O | 276 | 1.045 |
SMC | S-METHYLCYSTEINE | CSC[C@H](N)C(O)=O | 258 | 0.977 |
M3L | N-TRIMETHYLLYSINE | C[N+](C)(C)CCCC[C@H](N)C(O)=O | 254 | 0.961 |
ALY | N(6)-ACETYLLYSINE | CC(=O)NCCCC[C@H](N)C(O)=O | 239 | 0.905 |
GYS | [(4Z)-2-(1-AMINO-2-HYDROXYETHYL)-4-(4-HYDROXYBENZYLIDENE)-5-OXO-4,5-DIHYDRO-1H-IMIDAZOL-1-YL]ACETIC ACID | N[C@@H](CO)C1=N\C(=C/c2ccc(O)cc2)C(=O)N1CC(O)=O | 225 | 0.852 |
The image and a detailed description of each ncAA can be found in the PDBeChem databank at
SMILES components selected to create features.
Component | Description |
---|---|
=O | oxygen, forming a double bond |
O | any oxygen |
N | nitrogen, except an aromatic ring |
n | nitrogen in an aromatic ring |
=N | nitrogen, forming a double bond |
C | carbon, except an aromatic ring |
c | carbon in an aromatic ring |
[C@ | carbon as a chiral center |
S | sulfur |
c1 | any ring (aromatic or any other cycle) |
c2 | second ring (aromatic or any other cycle) |
= | any double bond |
+ | positive charge |
The selection of the optimal F-statistic value for the EISD840101 AAindex property.
F-Value = 1 | F-Value = 2 | F-Value = 2.4 | F-Value = 3 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
r | r j-n | P num | r | r j-n | P num | r | r j-n | P num | r | r j-n | P num |
0.997 | 0.850 | 11 | 0.997 | 0.848 | 10 | 0.953 | 0.924 | 10 | 0.953 | 0.890 | 3 |
r: Pearson’s correlation coefficient between the experimental and predicted values for all 20 canonical amino acids included in the learning model; rj-n: Pearson’s correlation calculated using the leave-one-out cross validation approach; Pnum: the number of statistically significant predictors included in the learning model. No other F-values produced higher or equivalent rj-n values. Thus, the highest correlation coefficient, rj-n = 0.924, corresponds to F-value = 2.4 and to the 10 relevant statistically significant predictors used to predict the EISD840101 AAindex property.
Supplementary Materials
The supporting information can be downloaded at
References
1. Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res.; 2008; 36, pp. D202-D205. [DOI: https://dx.doi.org/10.1093/nar/gkm998] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/17998252]
2. Rodrigues, C.H.M.; Myung, Y.; Pires, D.E.V.; Ascher, D.B. mCSM-PPI2: Predicting the effects of mutations on protein-protein interactions. Nucleic Acids Res.; 2019; 47, pp. W338-W344. [DOI: https://dx.doi.org/10.1093/nar/gkz383] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31114883]
3. Wang, H.; Liu, C.; Deng, L. Enhanced Prediction of Hot Spots at Protein-Protein Interfaces Using Extreme Gradient Boosting. Sci. Rep.; 2018; 8, 14285. [DOI: https://dx.doi.org/10.1038/s41598-018-32511-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30250210]
4. Rozhonova, H.; Marti-Gomez, C.; McCandlish, D.M.; Payne, J.L. Robust genetic codes enhance protein evolvability. PLoS Biol.; 2024; 22, e3002594. [DOI: https://dx.doi.org/10.1371/journal.pbio.3002594] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38754362]
5. Schmitt, A.; Schuchhardt, J.; Brockmann, G.A. The action of key factors in protein evolution at high temporal resolution. PLoS ONE; 2009; 4, e4821. [DOI: https://dx.doi.org/10.1371/journal.pone.0004821]
6. Yampolsky, L.Y.; Bouzinier, M.A. Evolutionary patterns of amino acid substitutions in 12 Drosophila genomes. BMC Genom.; 2010; 11, (Suppl. S4), S10. [DOI: https://dx.doi.org/10.1186/1471-2164-11-S4-S10]
7. Bohorquez, H.J.; Suarez, C.F.; Patarroyo, M.E. Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements. Sci. Rep.; 2017; 7, 7717. [DOI: https://dx.doi.org/10.1038/s41598-017-08041-7]
8. Rimal, P.; Panday, S.K.; Xu, W.; Peng, Y.; Alexov, E. SAAMBE-MEM: A sequence-based method for predicting binding free energy change upon mutation in membrane protein-protein complexes. Bioinformatics; 2024; 40, btae544. [DOI: https://dx.doi.org/10.1093/bioinformatics/btae544]
9. Kuang, J.; Zhao, Z.; Yang, Y.; Yan, W. PON-Tm: A Sequence-Based Method for Prediction of Missense Mutation Effects on Protein Thermal Stability Changes. Int. J. Mol. Sci.; 2024; 25, 8379. [DOI: https://dx.doi.org/10.3390/ijms25158379]
10. Aljarf, R.; Shen, M.; Pires, D.E.V.; Ascher, D.B. Understanding and predicting the functional consequences of missense mutations in BRCA1 and BRCA2. Sci. Rep.; 2022; 12, 10458. [DOI: https://dx.doi.org/10.1038/s41598-022-13508-3]
11. Nishi, H.; Tyagi, M.; Teng, S.; Shoemaker, B.A.; Hashimoto, K.; Alexov, E.; Wuchty, S.; Panchenko, A.R. Cancer missense mutations alter binding properties of proteins and their interaction networks. PLoS ONE; 2013; 8, e66273. [DOI: https://dx.doi.org/10.1371/journal.pone.0066273] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23799087]
12. Livesey, B.J.; Marsh, J.A. The properties of human disease mutations at protein interfaces. PLoS Comput. Biol.; 2022; 18, e1009858. [DOI: https://dx.doi.org/10.1371/journal.pcbi.1009858] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35120134]
13. Sekiyama, N.; Takaba, K.; Maki-Yonekura, S.; Akagi, K.I.; Ohtani, Y.; Imamura, K.; Terakawa, T.; Yamashita, K.; Inaoka, D.; Yonekura, K. et al. ALS mutations in the TIA-1 prion-like domain trigger highly condensed pathogenic structures. Proc. Natl. Acad. Sci. USA; 2022; 119, e2122523119. [DOI: https://dx.doi.org/10.1073/pnas.2122523119]
14. Ruiz-Blanco, Y.B.; Aguero-Chapin, G.; Garcia-Hernandez, E.; Alvarez, O.; Antunes, A.; Green, J. Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone. BMC Bioinform.; 2017; 18, 349. [DOI: https://dx.doi.org/10.1186/s12859-017-1758-x]
15. Vanella, R.; Kovacevic, G.; Doffini, V.; Fernandez de Santaella, J.; Nash, M.A. High-throughput screening, next generation sequencing and machine learning: Advanced methods in enzyme engineering. Chem. Commun.; 2022; 58, pp. 2455-2467. [DOI: https://dx.doi.org/10.1039/D1CC04635G]
16. Ramirez-Palacios, C.; Marrink, S.J. Super High-Throughput Screening of Enzyme Variants by Spectral Graph Convolutional Neural Networks. J. Chem. Theory Comput.; 2023; 19, pp. 4668-4677. [DOI: https://dx.doi.org/10.1021/acs.jctc.2c01227]
17. Li, G.; Jia, L.; Wang, K.; Sun, T.; Huang, J. Prediction of Thermostability of Enzymes Based on the Amino Acid Index (AAindex) Database and Machine Learning. Molecules; 2023; 28, 8097. [DOI: https://dx.doi.org/10.3390/molecules28248097]
18. Kim, H.; Kihara, D. Protein structure prediction using residue- and fragment-environment potentials in CASP11. Proteins; 2016; 84, (Suppl. S1), pp. 105-117. [DOI: https://dx.doi.org/10.1002/prot.24920]
19. Kloczkowski, A.; Jernigan, R.L.; Wu, Z.; Song, G.; Yang, L.; Kolinski, A.; Pokarowski, P. Distance matrix-based approach to protein structure prediction. J. Struct. Funct. Genom.; 2009; 10, pp. 67-81. [DOI: https://dx.doi.org/10.1007/s10969-009-9062-2]
20. Ren, J.; Liu, Q.; Ellis, J.; Li, J. Tertiary structure-based prediction of conformational B-cell epitopes through B factors. Bioinformatics; 2014; 30, pp. i264-i273. [DOI: https://dx.doi.org/10.1093/bioinformatics/btu281]
21. Milchevskiy, Y.V.; Milchevskaya, V.Y.; Nikitin, A.M.; Kravatsky, Y.V. Effective Local and Secondary Protein Structure Prediction by Combining a Neural Network-Based Approach with Extensive Feature Design and Selection without Reliance on Evolutionary Information. Int. J. Mol. Sci.; 2023; 24, 15656. [DOI: https://dx.doi.org/10.3390/ijms242115656] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37958639]
22. Dong, B.; Liu, Z.; Xu, D.; Hou, C.; Dong, G.; Zhang, T.; Wang, G. SERT-StructNet: Protein secondary structure prediction method based on multi-factor hybrid deep model. Comput. Struct. Biotechnol. J.; 2024; 23, pp. 1364-1375. [DOI: https://dx.doi.org/10.1016/j.csbj.2024.03.018] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38596312]
23. Dong, B.; Liu, Z.; Xu, D.; Hou, C.; Niu, N.; Wang, G. Impact of Multi-Factor Features on Protein Secondary Structure Prediction. Biomolecules; 2024; 14, 1155. [DOI: https://dx.doi.org/10.3390/biom14091155] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39334921]
24. Vishnepolsky, B.; Grigolava, M.; Gabrielian, A.; Rosenthal, A.; Hurt, D.; Tartakovsky, M.; Pirtskhalava, M. Analysis, Modeling, and Target-Specific Predictions of Linear Peptides Inhibiting Virus Entry. ACS Omega; 2023; 8, pp. 46218-46226. [DOI: https://dx.doi.org/10.1021/acsomega.3c07521]
25. Nath, A. Physicochemical and sequence determinants of antiviral peptides. Biol. Futur.; 2023; 74, pp. 489-506. [DOI: https://dx.doi.org/10.1007/s42977-023-00188-x]
26. Codina, J.R.; Mascini, M.; Dikici, E.; Deo, S.K.; Daunert, S. Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. Int. J. Mol. Sci.; 2023; 24, 12144. [DOI: https://dx.doi.org/10.3390/ijms241512144]
27. Han, J.; Kong, T.; Liu, J. PepNet: An interpretable neural network for anti-inflammatory and antimicrobial peptides prediction using a pre-trained protein language model. Commun. Biol.; 2024; 7, 1198. [DOI: https://dx.doi.org/10.1038/s42003-024-06911-1]
28. Ong, S.A.; Lin, H.H.; Chen, Y.Z.; Li, Z.R.; Cao, Z. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinform.; 2007; 8, 300. [DOI: https://dx.doi.org/10.1186/1471-2105-8-300]
29. Hecht, M.; Bromberg, Y.; Rost, B. Better prediction of functional effects for sequence variants. BMC Genom.; 2015; 16, (Suppl. S8), S1. [DOI: https://dx.doi.org/10.1186/1471-2164-16-S8-S1]
30. Xu, J.; Li, F.; Li, C.; Guo, X.; Landersdorfer, C.; Shen, H.H.; Peleg, A.Y.; Li, J.; Imoto, S.; Yao, J. et al. iAMPCN: A deep-learning approach for identifying antimicrobial peptides and their functional activities. Brief. Bioinform.; 2023; 24, bbad240. [DOI: https://dx.doi.org/10.1093/bib/bbad240]
31. Nordquist, E.; Zhang, G.; Barethiya, S.; Ji, N.; White, K.M.; Han, L.; Jia, Z.; Shi, J.; Cui, J.; Chen, J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: A case study of BK channels. PLoS Comput. Biol.; 2023; 19, e1011460. [DOI: https://dx.doi.org/10.1371/journal.pcbi.1011460] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37713443]
32. Collingridge, P.W.; Kelly, S. MergeAlign: Improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments. BMC Bioinform.; 2012; 13, 117. [DOI: https://dx.doi.org/10.1186/1471-2105-13-117] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22646090]
33. Liu, B.; Xu, J.; Zou, Q.; Xu, R.; Wang, X.; Chen, Q. Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinform.; 2014; 15, (Suppl. S2), S3. [DOI: https://dx.doi.org/10.1186/1471-2105-15-S2-S3]
34. Koehl, P.; Orland, H.; Delarue, M. Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments. Molecules; 2018; 24, 104. [DOI: https://dx.doi.org/10.3390/molecules24010104] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30597916]
35. Hollebrands, B.; Hageman, J.A.; van de Sande, J.W.; Albada, B.; Janssen, H.G. Improved LC-MS identification of short homologous peptides using sequence-specific retention time predictors. Anal. Bioanal. Chem.; 2023; 415, pp. 2715-2726. [DOI: https://dx.doi.org/10.1007/s00216-023-04670-2]
36. Sultan, M.F.; Shaon, M.S.H.; Karim, T.; Ali, M.M.; Hasan, M.Z.; Ahmed, K.; Bui, F.M.; Chen, L.; Dhasarathan, V.; Moni, M.A. MLAFP-XN: Leveraging neural network model for development of antifungal peptide identification tool. Heliyon; 2024; 10, e37820. [DOI: https://dx.doi.org/10.1016/j.heliyon.2024.e37820]
37. Yao, L.; Xie, P.; Guan, J.; Chung, C.R.; Zhang, W.; Deng, J.; Huang, Y.; Chiang, Y.C.; Lee, T.Y. ACP-CapsPred: An explainable computational framework for identification and functional prediction of anticancer peptides based on capsule network. Brief. Bioinform.; 2024; 25, bbae460. [DOI: https://dx.doi.org/10.1093/bib/bbae460]
38. Liang, X.; Zhao, H.; Wang, J. MA-PEP: A novel anticancer peptide prediction framework with multimodal feature fusion based on attention mechanism. Protein Sci.; 2024; 33, e4966. [DOI: https://dx.doi.org/10.1002/pro.4966]
39. Sun, S.; Yang, X.; Wang, Y.; Shen, X. In Vivo Analysis of Protein-Protein Interactions with Bioluminescence Resonance Energy Transfer (BRET): Progress and Prospects. Int. J. Mol. Sci.; 2016; 17, 1704. [DOI: https://dx.doi.org/10.3390/ijms17101704]
40. Vickers, T.A.; Crooke, S.T. Development of a Quantitative BRET Affinity Assay for Nucleic Acid-Protein Interactions. PLoS ONE; 2016; 11, e0161930. [DOI: https://dx.doi.org/10.1371/journal.pone.0161930]
41. Lostao, A.; Lim, K.; Pallares, M.C.; Ptak, A.; Marcuello, C. Recent advances in sensing the inter-biomolecular interactions at the nanoscale—A comprehensive review of AFM-based force spectroscopy. Int. J. Biol. Macromol.; 2023; 238, 124089. [DOI: https://dx.doi.org/10.1016/j.ijbiomac.2023.124089] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36948336]
42. Katoh, T.; Sengoku, T.; Hirata, K.; Ogata, K.; Suga, H. Ribosomal synthesis and de novo discovery of bioactive foldamer peptides containing cyclic beta-amino acids. Nat. Chem.; 2020; 12, pp. 1081-1088. [DOI: https://dx.doi.org/10.1038/s41557-020-0525-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32839601]
43. Adaligil, E.; Song, A.; Cunningham, C.N.; Fairbrother, W.J. Ribosomal Synthesis of Macrocyclic Peptides with Linear gamma(4)- and beta-Hydroxy-gamma(4)-amino Acids. ACS Chem. Biol.; 2021; 16, pp. 1325-1331. [DOI: https://dx.doi.org/10.1021/acschembio.1c00292] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34270222]
44. Goettig, P.; Koch, N.G.; Budisa, N. Non-Canonical Amino Acids in Analyses of Protease Structure and Function. Int. J. Mol. Sci.; 2023; 24, 14035. [DOI: https://dx.doi.org/10.3390/ijms241814035]
45. Fuertes, G.; Sakamoto, K.; Budisa, N. Editorial: Exploring and expanding the protein universe with non-canonical amino acids. Front. Mol. Biosci.; 2023; 10, 1303286. [DOI: https://dx.doi.org/10.3389/fmolb.2023.1303286]
46. Castro, T.G.; Melle-Franco, M.; Sousa, C.E.A.; Cavaco-Paulo, A.; Marcos, J.C. Non-Canonical Amino Acids as Building Blocks for Peptidomimetics: Structure, Function, and Applications. Biomolecules; 2023; 13, 981. [DOI: https://dx.doi.org/10.3390/biom13060981]
47. Lugtenburg, T.; Gran-Scheuch, A.; Drienovska, I. Non-canonical amino acids as a tool for the thermal stabilization of enzymes. Protein Eng. Des. Sel.; 2023; 36, gzad003. [DOI: https://dx.doi.org/10.1093/protein/gzad003]
48. Pham, P.N.; Zahradnik, J.; Kolarova, L.; Schneider, B.; Fuertes, G. Regulation of IL-24/IL-20R2 complex formation using photocaged tyrosines and UV light. Front. Mol. Biosci.; 2023; 10, 1214235. [DOI: https://dx.doi.org/10.3389/fmolb.2023.1214235]
49. Khoury, G.A.; Smadbeck, J.; Tamamis, P.; Vandris, A.C.; Kieslich, C.A.; Floudas, C.A. Forcefield_NCAA: Ab initio charge parameters to aid in the discovery and design of therapeutic proteins and peptides with unnatural amino acids and their application to complement inhibitors of the compstatin family. ACS Synth. Biol.; 2014; 3, pp. 855-869. [DOI: https://dx.doi.org/10.1021/sb400168u]
50. Croitoru, A.; Park, S.J.; Kumar, A.; Lee, J.; Im, W.; MacKerell, A.D., Jr.; Aleksandrov, A. Additive CHARMM36 Force Field for Nonstandard Amino Acids. J. Chem. Theory Comput.; 2021; 17, pp. 3554-3570. [DOI: https://dx.doi.org/10.1021/acs.jctc.1c00254]
51. Renfrew, P.D.; Choi, E.J.; Bonneau, R.; Kuhlman, B. Incorporation of noncanonical amino acids into Rosetta and use in computational protein-peptide interface design. PLoS ONE; 2012; 7, e32637. [DOI: https://dx.doi.org/10.1371/journal.pone.0032637] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22431978]
52. Hickey, J.L.; Sindhikara, D.; Zultanski, S.L.; Schultz, D.M. Beyond 20 in the 21st Century: Prospects and Challenges of Non-canonical Amino Acids in Peptide Drug Discovery. ACS Med. Chem. Lett.; 2023; 14, pp. 557-565. [DOI: https://dx.doi.org/10.1021/acsmedchemlett.3c00037] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37197469]
53. Zhang, H.; Zheng, Z.; Dong, L.; Shi, N.; Yang, Y.; Chen, H.; Shen, Y.; Xia, Q. Rational incorporation of any unnatural amino acid into proteins by machine learning on existing experimental proofs. Comput. Struct. Biotechnol. J.; 2022; 20, pp. 4930-4941. [DOI: https://dx.doi.org/10.1016/j.csbj.2022.08.063] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36147660]
54. Eisenberg, D. Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem.; 1984; 53, pp. 595-623. [DOI: https://dx.doi.org/10.1146/annurev.bi.53.070184.003115]
55. Fasman, G.D. Handbook of Biochemistry and Molecular Biology; 3rd ed. Fasman, G.D. CRC Press: Cleveland, OH, USA, 1976; Volume 1.
56. Charton, M.; Charton, B.I. The dependence of the Chou-Fasman parameters on amino acid side chain structure. J. Theor. Biol.; 1983; 102, pp. 121-134. [DOI: https://dx.doi.org/10.1016/0022-5193(83)90265-5]
57. Dayhoff, M.O. Atlas of protein sequence and structure. National Biomedical Research Foundation; National Geodetic Survey, NOAA: Silver Spring, MD, USA, 1972; Volume 5, 5.
58. Chou, P.Y.; Fasman, G.D. Empirical predictions of protein conformation. Annu. Rev. Biochem.; 1978; 47, pp. 251-276. [DOI: https://dx.doi.org/10.1146/annurev.bi.47.070178.001343]
59. Bella, J.; Eaton, M.; Brodsky, B.; Berman, H.M. Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution. Science; 1994; 266, pp. 75-81. [DOI: https://dx.doi.org/10.1126/science.7695699]
60. Burjanadze, T.V. Hydroxyproline content and location in relation to collagen thermal stability. Biopolymers; 1979; 18, pp. 931-938. [DOI: https://dx.doi.org/10.1002/bip.1979.360180413]
61. O’Brien, K.T.; Mooney, C.; Lopez, C.; Pollastri, G.; Shields, D.C. Prediction of polyproline II secondary structure propensity in proteins. R. Soc. Open Sci.; 2020; 7, 191239. [DOI: https://dx.doi.org/10.1098/rsos.191239]
62. Roseman, M.A. Hydrophilicity of polar amino acid side-chains is markedly reduced by flanking peptide bonds. J. Mol. Biol.; 1988; 200, pp. 513-522. [DOI: https://dx.doi.org/10.1016/0022-2836(88)90540-2]
63. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem. Inf. Comput. Sci.; 1988; 28, pp. 31-36. [DOI: https://dx.doi.org/10.1021/ci00057a005]
64. Weininger, D.; Weininger, A.; Weininger, J.L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci.; 1989; 29, pp. 97-101. [DOI: https://dx.doi.org/10.1021/ci00062a008]
65. O’Boyle, N.M. Towards a Universal SMILES representation—A standard method to generate canonical SMILES based on the InChI. J. Cheminform.; 2012; 4, 22. [DOI: https://dx.doi.org/10.1186/1758-2946-4-22] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22989151]
66. Landrum, G. Open-Source Cheminformatics. Available online: https://www.rdkit.org (accessed on 17 November 2024).
67. Mauri, A.; Consonni, V.; Pavan, M.; Todeschini, R. Dragon software: An easy approach to molecular descriptor calculations. MATCH Commun. Math. Comput. Chem.; 2006; 56, pp. 237-248.
68. Willighagen, E.L.; Mayfield, J.W.; Alvarsson, J.; Berg, A.; Carlsson, L.; Jeliazkova, N.; Kuhn, S.; Pluskal, T.; Rojas-Chertó, M.; Spjuth, O. The Chemistry Development Kit (CDK) v2.0: Atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform.; 2017; 9, 33. [DOI: https://dx.doi.org/10.1186/s13321-017-0220-4]
69. Masand, V.H.; Rastija, V. PyDescriptor: A new PyMOL plugin for calculating thousands of easily understandable molecular descriptors. Chemometr. Intell. Lab. Syst.; 2017; 169, pp. 12-18. [DOI: https://dx.doi.org/10.1016/j.chemolab.2017.08.003]
70. Bjerrum, E.J. SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv; 2017; [DOI: https://dx.doi.org/10.48550/arXiv.1703.07076] arXiv: 1703.07076
71. Li, X.; Fourches, D. Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. J. Cheminform.; 2020; 12, 27. [DOI: https://dx.doi.org/10.1186/s13321-020-00430-x]
72. Kimber, T.B.; Engelke, S.; Tetko, I.V.; Bruno, E.; Godin, G. Synergy effect between convolutional neural networks and the multiplicity of SMILES for improvement of molecular prediction. arXiv; 2018; [DOI: https://dx.doi.org/10.48550/arXiv.1812.04439] arXiv: 1812.04439
73. Tetko, I.V.; Karpov, P.; Bruno, E.; Kimber, T.B.; Godin, G. Augmentation is what you need!. Proceedings of the International Conference on Artificial Neural Networks; Munich, Germany, 17–19 September 2019; Springer: New York, NY, USA, 2019; pp. 831-835.
74. Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data; 2019; 6, 60. [DOI: https://dx.doi.org/10.1186/s40537-019-0197-0]
75. Berman, H.M.; Kleywegt, G.J.; Nakamura, H.; Markley, J.L. The Protein Data Bank at 40: Reflecting on the past to prepare for the future. Structure; 2012; 20, pp. 391-396. [DOI: https://dx.doi.org/10.1016/j.str.2012.01.010] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22404998]
76. Dimitropoulos, D.; Ionides, J.; Henrick, K. Using MSDchem to search the PDB ligand dictionary. Curr. Protoc. Bioinform.; 2006; 15, pp. 14.3.1-14.3.21. [DOI: https://dx.doi.org/10.1002/0471250953.bi1403s15] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/18428761]
77. Zhang, S.; Krieger, J.M.; Zhang, Y.; Kaya, C.; Kaynak, B.; Mikulska-Ruminska, K.; Doruker, P.; Li, H.; Bahar, I. ProDy 2.0: Increased scale and scope after 10 years of protein dynamics modelling with Python. Bioinformatics; 2021; 37, pp. 3657-3659. [DOI: https://dx.doi.org/10.1093/bioinformatics/btab187] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33822884]
78. Fauchere, J.L.; Charton, M.; Kier, L.B.; Verloop, A.; Pliska, V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Int. J. Pept. Protein Res.; 1988; 32, pp. 269-278. [DOI: https://dx.doi.org/10.1111/j.1399-3011.1988.tb01261.x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/3209351]
79. Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; Volume 2, 745.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The physicochemical properties of amino acid residues from the AAindex database are widely used as predictors in building models for predicting both protein structures and properties. It should be noted, however, that the AAindex database contains data only for the 20 canonical amino acids. Non-canonical amino acids, while less common, are not rare; the Protein Data Bank includes proteins with more than 1000 distinct non-canonical amino acids. In this study, we propose a method to evaluate the physicochemical properties from the AAindex database for non-canonical amino acids and assess the prediction quality. We implemented our method as a bioinformatics tool and estimated the physicochemical properties of non-canonical amino acids from the PDB with the chemical composition presentation using SMILES encoding obtained from the PDBechem databank. The bioinformatics tool and resulting database of the estimated properties are freely available on the author’s website and available for download via GitHub.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details


1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Str., 32, 119991 Moscow, Russia;
2 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Str., 32, 119991 Moscow, Russia;