1. Introduction
We have been working since 2015 on the problem of testing the alignment of protein domain families which are proposed by expert biologists and bioinformaticians. We have found that the use of selected entropy measures is very proficient for testing the results published by those professionals and they favour a rigorous ANOVA statistical analysis [1]. In order to reduce the search space for admissible values of entropy measures, we have emphasized the need for work in the region related to strict concavity of these entropies. This study has been undertaken in a previous work, and we present in Section 2 a summary of those developments. In the present work, we aim to complement the results of a previous publication [2], and a subsequent restriction on the parameter space has to be performed in order to guarantee the synergy of the probability distributions to be tested. Non-synergetic distributions are not worthwhile for working because they will not preserve the fundamental property of getting more information of amino acids into t-sets of columns than to sum up the information obtained from individual columns. In Section 3, a brief digression is then made for introducing the Sharma–Mittal class of entropy measures. Section 4 emphasizes the aspects of synergy of the distributions and their consequences for the reduction of the parameter space of Sharma–Mittal entropies. In Section 5, we treat the analysis of the maximal extension of the parameter space, and we repeat the reduction process imposed by the requirement of fully synergetic distributions of Section 4. We conclude the paper in Section 6 by studying the relation of Hölder and generalized Khinchin–Shannon (GKS) inequalities.
2. The Construction of the Probabilistic Space
Let us consider a set of domains ( rows) from a chosen family of protein domains. In order to associate a rectangular array with this family, to be taken as its representative in the probabilistic space we are constructing, we specify its number of columns as . This means that among rows, we disregard all rows such that the number of their amino acids satisfies and preserve rows whose number of amino acids satisfies , but disregard () amino acids in these rows. We then choose m rows from among the rows to obtain rectangular arrays. There are of these rectangular arrays. Any one of them can be used as a representative of the domain family to be analysed in the statistical procedure to be implemented.
The next step is to assign a joint probability of occurrence of a set of variables in columns to be given by
(1)
where stands for the number of occurrences of the set in the t columns of the subarray of the representative array (). The symbols will be running over the letters of the one-letter code for the twenty amino acids: {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.We then have
(2)
We also introduce the conditional probabilities of occurrence, which are given implicitly by(3)
where is the probability of occurrence of the amino acids in the columns , if the distribution of amino acids in the -th column is known a priori.The Bayes’ law for probabilities of occurrence [2,3] can be written as
(4)
The equality of the three first right-side members, as well as the equality of the three last ones, does correspond to the application of Bayes’ law [2,3]. The symmetries for the joint probability distribution are due to the ordering of the columns for the distributions of amino acids.From the ordering , the values assumed by the variables are respectively given by
(5)
We then have geometric objects of t columns and components each.3. The Sharma–Mittal Class of Entropy Measures
As emphasized in Ref. [2], the introduction of random variable functions such as entropy measures associated with the probabilities of occurrence, is suitable to provide an analysis of the evolution of these probabilities through the regions of the parameter space of entropies. The class of Sharma–Mittal entropy measures seems to be particularly adapted to this task when related to the occurrence of amino acids in the objects . The thermodynamic interpretation of the notion of entropy greatly helps to classify the distribution of its values associated with protein domain databases and to interpret its evolution through the Fokker–Planck equations to be treated in forthcoming articles in this line of research.
The two-parameter Sharma–Mittal class of entropy measures is usually given by
(6)
where(7)
The parameters r, s must bound a region corresponding to a strict concavity in the parameter space. A necessary requirement to be satisfied [3] is
(8)
where stands for the escort probability associated with the joint probability , or,(9)
Equation (8) leads to(10)
Some special cases of one-parameter entropies are commonplace in the scientific literature [3,4,5,6,7,8,9]:The region is the domain of the Havrda–Charvat [6] entropy measure ,
(11)
The , , region will stand for the domain of the Landsberg–Vedral [7] entropy measure, ,(12)
The Renyi [8] and the “non-extensive” Gaussian [9] entropy measures are obtained from limit processes:(13)
After using the definition of , Equation (7), and from Equations (1) and (2), we get:(14)
where is the Gibbs–Shannon entropy measure(15)
The Gibbs–Shannon entropy measure, Equation (15), is also obtained by taking the convenient limits of the special cases of Sharma–Mittal entropies, Equations (11)–(14):
(16)
We shall analyse in the next section the structure of the two-parameter space of Sharma–Mittal entropy by taking into consideration these special cases.
We are now reminded that for the limit of Gibbs–Shannon entropy, a conditional entropy measure is defined [3] by
(17)
We then have analogously for the conditional Sharma–Mittal entropy measure [3](18)
It is easy to show by trivial calculation that, analogously to Equation (16), we will have(19)
From Equations (6), (7) and (18) and the application of the Bayes’ law, Equation (4), we can write
(20)
4. Aspects of Synergy and the Reduction of the Parameter Space for Fully Synergetic Distributions
For the Gibbs–Shannon entropy measure, the inequality written by A. Y. Khinchin [3,10] is
(21)
This inequality would be described by Khinchin as: “On the average, the knowledge a priori of the distribution on the column can only decrease the uncertainty of the distribution on the columns”. We can write an analogous inequality for the Sharma–Mittal class of entropies(22)
We then get from Equations (20) and (22)(23)
After iteration of this equation, , we can also write(24)
The inequalities in (21)–(24) are associated with what are called “synergetic conditions”. In this section, we also derive the fully synergetic conditions as GKS inequalities.After using Equations (7) and (9) in Equation (23), we get
(25)
and after iteration and use of Equation (24)(26)
The hatched region of strict concavity in the parameter space of Sharma–Mittal entropies, , is depicted in Figure 1. The special cases corresponding to Havrda–Charvat’s (), Landsberg–Vedral’s (), Renyi’s (), and “non-extensive” Gaussian’s () entropies are also represented.
We can identify three subregions in Figure 1. They will correspond to
(27)
(28)
(29)
where the ordering of -symbols has been obtained from Equation (26). The subregions R and R are what we call fully synergetic subregions, and the corresponding inequalities are the GKS inequalities [2].The subregions R, R, and R are depicted in Figure 2a–c, respectively. The union of subregions R and R is the fully synergetic Khinchin–Shannon restriction to be imposed on the strict concavity region of Figure 1 and it is depicted in Figure 2d below.
5. The Maximal Extension of the Parameter Space and Its Reduction for Fully Synergetic Distribution
In Figure 1 and Figure 2d, we have depicted the structure of the strict concavity region for Sharma–Mittal entropy measures and its reduction to a subregion by the application of the requirement of fully synergetic distributions, respectively. Our analysis has used a coarse-grained approach to concavity given by Equations (8) and (10). We now introduce some necessary refinements for characterizing the probability of occurrence in subarrays of m rows and t columns, . For t columns, there are possibilities of occurrence of amino acids, which could be a large number, but we could count not individual amino acids, but groups of t-sets of amino acids (-groups) which appear on the m rows of the array. We characterize these -groups by , from all equal -groups () to m different -groups (). We also call , the number of equal t-sets of a given -group.
In Equation (2), the sum is over all the amino acids that make up the geometric object defined in Equation (1), the probability of occurrence. We can now perform the sum over -groups and write
(30)
where are the t-sets of a -group. We also have from Equation (7)(31)
From Equations (30) and (31), we can now proceed to the calculation of the Hessian matrix for Sharma–Mittal entropy measures. We have for the first derivative of
(32)
We then have for a generic element of the Hessian matrix [2](33)
where is the escort probability associated to , or(34)
The principal minors are given by(35)
and we have(36)
according to Equation (31).From Equations (35) and (36), the requirement of strict concavity will lead to
(37)
We then have(38)
This does correspond to the criterion of negative definiteness of the Hessian matrix for strict concavity of multivariate functions [11].Each k-value is associated with the k-epigraph region, which is the k-extension of the strict concavity region presented in Figure 1. These regions are given by
(39)
The greatest lower bound of the sequence of k-curves is given by . We then have
(40)
We can then write for the maximal extended region of strict concavity(41)
The region corresponding to Equation (41) is depicted in Figure 3 below.We are now ready to undertake the application of restrictions for fully synergetic distributions (validity of GKS inequalities) to the maximal strict concavity region of Figure 3.
We start by identifying two regions included in Figure 3. They will be given by
(42)
(43)
These regions are depicted in Figure 4a,b, respectively.In order to find the reduced region corresponding to Figure 3, analogously to what has been done for Figure 1, we also need the subregions R, R, Equations (27) and (29): the resulting subregion of fully synergetic distributions is given by and is depicted in Figure 5.
6. Hölder Inequalities and GKS Inequalities: A Possible Conjecture
In this section, we study the relation between GKS inequalities [2] and Hölder inequalities by using examples of distributions obtained from databases of protein domain families. In order to start, some definitions and properties of the probabilistic space are now in order.
Let us first introduce the definition of the conditional probability of occurrence of the escort probability of occurrence [12]. This is a simple application to escort probabilities of Equation (3):
(44)
From the definitions of escort probabilities, Equation (9), we can write(45)
and(46)
In Equations (44)–(46), the symbols ; assume the representative letters of the one-letter code for the 20 amino acids, ; {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.After substituting Equations (45) and (46) into Equation (44), we get
(47)
and from Equation (46)(48)
We also write the definition of escort probability of occurrence of the conditional probability of occurrence [12]
(49)
We can check the definitions of Equations (48) and (49) from the equality of the two escort probabilities with the original conditional probability, for(50)
We should note that the denominators of the right-hand sides of Equations (48) and (49), or,(51)
and(52)
will be equal if all amino acids in the column are equal. If we have, for instance, the column given by:(53)
The unit vectors of probabilities and will also be equal and given by(54)
This means that for this special case of an event of rare occurrence, we also have the equality of the conditional of the escort probability and the escort probability of the conditional probability, or the left-hand sides of Equations (48) and (49), respectively.For a -column with a generic distribution of amino acids, the denominators Z and on the right-hand sides of Equations (48) and (49) will no longer be equal. An ordering of these denominators should be decided from the probabilities of amino acid occurrence in a chosen protein domain family.
This study is undertaken with the help of the functions Z and of Equations (51) and (52) and with the functions J and U, defined below:
(55)
(56)
Our method will then be the comparison of pairs of functions in order to proceed with the search for the effect of fully synergetic distributions of amino acids.
There are six comparisons to study:
(I).
(57)
(II).
(58)
(III).
(59)
(IV).
(60)
where is defined by,(61)
(V).
(62)
(VI).
(63)
Equations (57)–(59) should be multiplied by and after that, each one has to be summed over . We then have, respectively,
(64)
(65)
(66)
Equations (60), (62) and (63) can be written, respectively, as(67)
(68)
(69)
The Hölder’s inequality as applied to probabilities of occurrence [3] is written as
(70)
After multiplying by and summing over , we get(71)
We also define(72)
(73)
We then summarize the results obtained:
Equation (64) is only an identity: .
Equations (65) and (68) can be ordered by Hölder’s inequality, Equations (70) and (71).
Equations (66) and (69) can be ordered by GKS inequalities, corresponding to fully synergetic distributions of amino acids, .
Equation (67) cannot be ordered without additional experimental/phenomenological information on the probabilities of occurrence to be obtained from updated versions of protein domain family databases [13].
We now collect the formulae obtained from the analysis performed on this section. Equations (65) and (68) are ordered by Hölder’s inequality. We write
(74)
Equations (66) and (69) are ordered by GKS inequality. We write(75)
After using Equation (73), we can write Equation (67) as(76)
In Figure 6a,b we have depicted the curves corresponding to functions and for seven 3-sets of contiguous columns and 80 rows, chosen from databases Pfam 27.0 and Pfam 35.0, respectively. There are also inset figures in order to show the curves for .
In Figure 7a,b, we do the same for the differences . We emphasize that for the 3-sets such that , , the GKS inequalities will result from the validity of Hölder’s inequality. We have worked with the PF01926 protein domain family to perform all the calculations.
7. Concluding Remarks
The first comment we want to make to the present work is about the possibility of working in a region of the parameter space that preserves the strict concavity and the fully synergetic structure of the Sharma–Mittal class of entropy measure distributions to be visited by solutions of a new successful statistical mechanics approach. The usual work with Havrda–Charvat distributions describes the evolution along the boundary () of the region () that was correctly considered to correspond to strict concavity, but it is also known to be non-synergetic for . We now have the opportunity to develop this statistical mechanics approach along an extended boundary, preserving the strict concavity and providing the study of the evolution of fully synergetic entropy distributions. A first sketch of these developments will be presented in a forthcoming publication.
With respect to Figure 6 and Figure 7, we could hypothesize that if the ordering of B and could not be obtained, this would be due to the poor alignment of some protein domain families we have been using, but we are not confident enough that we could do this, because we would need much more information “in silico” to be obtained from many other protein domain families. In other words, we expect that a good alignment of a protein domain family will result in the ordering of B and , but we need to verify this in a large number of families from different Pfam versions before we proceed with a proposal of a method to improve the Pfam database. This looks promising for good scientific work in the line of research we have been aiming to introduce in Ref. [2] and in this contribution.
Conceptualization, R.P.M. and S.C.d.A.N.; methodology, R.P.M. and S.C.d.A.N.; formal analysis, R.P.M. and S.C.d.A.N.; writing—original draft preparation, R.P.M.; writing—review and editing, R.P.M. and S.C.d.A.N.; visualization, R.P.M. and S.C.d.A.N.; supervision, R.P.M.; project administration, R.P.M. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
Not applicable.
The authors declare no conflict of interest.
The following abbreviation is used in this manuscript:
GKS | Generalized Khinchin-Shannon |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figure 1. The strict concavity region [Forumla omitted. See PDF.] of Sharma–Mittal class of entropy measures. It is the epigraph of the curve ([Forumla omitted. See PDF.]), and this corresponds to the Havrda–Charvat entropy, which is depicted in brown. The Landsberg–Vedral’s ([Forumla omitted. See PDF.]), Renyi’s ([Forumla omitted. See PDF.]), and “non-extensive” Gaussian’s ([Forumla omitted. See PDF.]) are depicted in green, blue, and red, respectively.
Figure 2. Subregions of the strict concavity region [Forumla omitted. See PDF.] of the Sharma–Mittal class of entropy measures. (a) Khinchin–Shannon subregion R[Forumla omitted. See PDF.]—fully synergetic.; (b) The non-synergetic subregion R[Forumla omitted. See PDF.]; (c) Khinchin–Shannon subregion R[Forumla omitted. See PDF.]—fully synergetic; (d) Khinchin–Shannon subregion [Forumla omitted. See PDF.]—fully synergetic. The reduced subregion [Forumla omitted. See PDF.] of Figure 1 is obtained by taking into consideration fully synergetic distributions only.
Figure 2. Subregions of the strict concavity region [Forumla omitted. See PDF.] of the Sharma–Mittal class of entropy measures. (a) Khinchin–Shannon subregion R[Forumla omitted. See PDF.]—fully synergetic.; (b) The non-synergetic subregion R[Forumla omitted. See PDF.]; (c) Khinchin–Shannon subregion R[Forumla omitted. See PDF.]—fully synergetic; (d) Khinchin–Shannon subregion [Forumla omitted. See PDF.]—fully synergetic. The reduced subregion [Forumla omitted. See PDF.] of Figure 1 is obtained by taking into consideration fully synergetic distributions only.
Figure 3. The maximal strict concavity region of the Sharma–Mittal class of entropy measures. The hatched region is the epigraph of the curve [Forumla omitted. See PDF.] which is depicted in black. The Havrda–Charvat ([Forumla omitted. See PDF.]) region is in brown. The Landsberg–Vedral ([Forumla omitted. See PDF.]), Renyi ([Forumla omitted. See PDF.]), and “non-extensive” Gaussian ([Forumla omitted. See PDF.]) regions are depicted in green, blue, and red, respectively.
Figure 4. Subregions of the maximal strict concavity region of the Sharma–Mittal class of entropy measures (Figure 3). (a) Khinchin–Shannon subregion R[Forumla omitted. See PDF.]—fully synergetic; (b) The non-synergetic subregion R[Forumla omitted. See PDF.].
Figure 5. [Forumla omitted. See PDF.] is the reduction of the region of Figure 3 by taking into consideration the fully synergetic distributions only.
Figure 6. Hölder ([Forumla omitted. See PDF.], [Forumla omitted. See PDF.]) distributions (dashed curves) and Khinchin–Shannon ([Forumla omitted. See PDF.], [Forumla omitted. See PDF.]) distributions (continuous curves) of the PF01926 protein domain family from (a) protein domain family PF01926 obtained from Pfam 27.0 and (b) protein domain family PF01926 obtained from Pfam 35.0. The top-right inset shows details of the curves for [Forumla omitted. See PDF.].
Figure 7. [Forumla omitted. See PDF.] difference of the PF01926 protein domain family from (a) protein domain family PF01926 obtained from Pfam 27.0 and (b) protein domain family PF01926 obtained from Pfam 35.0. The top-right inset shows details of the curves for [Forumla omitted. See PDF.]. [Forumla omitted. See PDF.], [Forumla omitted. See PDF.] ⟹ [Forumla omitted. See PDF.].
References
1. Mondaini, R.P.; de Albuquerque Neto, S.C. The Statistical Analysis of Protein Domain Family Distributions via Jaccard Entropy Measures. Trends in Biomathematics: Modeling Cells, Flows, Epidemics, and the Environment; Mondaini, R.P. Springer: Cham, Switzerland, 2020; pp. 169-207.
2. Mondaini, R.P.; de Albuquerque Neto, S.C. Alternative Entropy Measures and Generalized Khinchin-Shannon Inequalities. Entropy; 2021; 23, 1618. [DOI: https://dx.doi.org/10.3390/e23121618] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34945924]
3. Mondaini, R.P.; de Albuquerque Neto, S.C. Khinchin–Shannon Generalized Inequalities for “Non-additive” Entropy Measures. Trends in Biomathematics: Mathematical Modeling for Health, Harvesting, and Population Dynamics; Mondaini, R.P. Springer: Cham, Switzerland, 2019; pp. 177-190.
4. Beck, C. Generalized Information and Entropy Measures in Physics. Contemp. Phys.; 2009; 50, pp. 495-510. [DOI: https://dx.doi.org/10.1080/00107510902823517]
5. Sharma, B.D.; Mittal, D.P. New Non-additive Measures of Entropy for Discrete Probability Distributions. J. Math. Sci.; 1975; 10, pp. 28-40.
6. Havrda, J.; Charvat, F. Quantification Method of Classification Processes. Concept of Structural α-entropy. Kybernetica; 1967; 3, pp. 30-35.
7. Landsberg, P.T.; Vedral, V. Distributions and Channel Capacities in Generalized Statistical Mechanics. Phys. Lett. A; 1998; 247, pp. 211-217. [DOI: https://dx.doi.org/10.1016/S0375-9601(98)00500-3]
8. Rényi, A. On Measures of Entropy and Information. Contributions to the Theory of Statistics, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Neyman, J. University of California Press: Berkeley, CA, USA, 1961; Volume 1, pp. 547-561.
9. Oikonomou, T. Properties of the “Non-extensive Gaussian” Entropy. Phys. A; 2007; 381, pp. 155-163. [DOI: https://dx.doi.org/10.1016/j.physa.2007.03.010]
10. Khinchin, A.Y. Mathematical Foundations of Information Theory; Dover Publications, Inc.: New York, NY, USA, 1957.
11. Marsden, J.E.; Tromba, A. Vector Calculus; 6th ed. W. H. Freeman and Company Publishers: New York, NY, USA, 2012.
12. Mondaini, R.P.; de Albuquerque Neto, S.C. The Maximal Extension of the Strict Concavity Region on the Parameter Space for Sharma-Mittal Entropy Measures. Trends in Biomathematics: Stability and Oscillations in Environmental Social and Biological Models; Mondaini, R.P. Springer: Cham, Switzerland, 2022.
13. Mistry, J.; Chuguransky, S.; Williams, L.; Qureshi, M.; Salazar, G.A.; Sonnhammer, E.L.L.; Tosatto, S.C.E.; Paladin, L.; Raj, S.; Richardson, L.J. et al. Pfam: The Protein Families Database in 2021. Nucleic Acids Res.; 2021; 49, pp. D412-D419. [DOI: https://dx.doi.org/10.1093/nar/gkaa913]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this contribution, we specify the conditions for assuring the validity of the synergy of the distribution of probabilities of occurrence. We also study the subsequent restriction on the maximal extension of the strict concavity region on the parameter space of Sharma–Mittal entropy measures, which has been derived in a previous paper in this journal. The present paper is then a necessary complement to that publication. Some applications of the techniques introduced here are applied to protein domain families (Pfam databases, versions 27.0 and 35.0). The results will show evidence of their usefulness for testing the classification work performed with methods of alignment that are used by expert biologists.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer