Content area
The demands for machine learning and knowledge extraction methods have been booming due to the unprecedented surge in data volume and data quality. Nevertheless, challenges arise amid the emerging data complexity as significant chunks of information and knowledge lie within the non-ordinal realm of data. To address the challenges, researchers developed considerable machine learning and knowledge extraction methods regarding various domain-specific challenges. To characterize and extract information from non-ordinal data, all the developed methods pointed to the subject of Information Theory, established following Shannon’s landmark paper in 1948. This article reviews recent developments in entropic statistics, including estimation of Shannon’s entropy and its functionals (such as mutual information and Kullback–Leibler divergence), concepts of entropic basis, generalized Shannon’s entropy (and its functionals), and their estimations and potential applications in machine learning and knowledge extraction. With the knowledge of recent development in entropic statistics, researchers can customize existing machine learning and knowledge extraction methods for better performance or develop new approaches to address emerging domain-specific challenges.
1. Introduction of Entropic Statistics
Entropic statistics is a collection of statistical procedures that characterize information from non-ordinal spaces with Shannon’s entropy and its generalized functionals. Such procedures includes but not limited to statistical methods involving Shannon’s entropy (entropy) and Mutual Information (MI) [1], Kullback–Leibler divergence (KL) [2], entropic basis and diversity index [3,4], and Generalized Shannon’s Entropy (GSE) and Generalized Mutual Information (GMI) [5]. The field of entropic statistics is at the intersection of information theory and statistics. Entropic statistics quantities are also referred as information-theoretic quantities [6,7].
There are two general data types—ordinal and non-ordinal (nominal). Ordinal data are data with an inherent numerical scale. For example, {52 F, 50 F, 49 F, 53 F}—a set of daily high temperatures at Nuuk, Greenland—is ordinal. Ordinal data are generated from random variables (which map outcomes from sample space to the real numbers). For ordinal data, classical concepts, such as moments (mean, variance, covariance, etc.) and characteristics functions, are powerful tools to induce various statistical methods, including but not limited to regression analysis [8] and analysis of variance (ANOVA) [9].
Non-ordinal data are data without an inherent numerical scale. For example, {androgen receptor, clock circadian regulator, epidermal growth factor, Werner syndrome RecQ helicase-like}—a subset of human genes names—is a set of data without inherent numerical scale. Non-ordinal data are generated from random elements (which map outcomes from sample space to alphabet). Due to the absence of inherent numerical scale, the concept of random variable is undefined according to its definition. Therefore, statistical concepts involving ordinal scale (e.g., mean, variance, covariance, and characteristic functions) no longer exist. For example, consider the mentioned data of human genes names; what is the mean or variance of the data? Such questions cannot be answered because the concepts of mean and variance do not exist, while in practice, researchers need to measure the level of dependence in non-ordinal joint space between gene types and genetic phenotype to study the gene’s functionalities. One would use covariance and its generated methods in ordinal data. However, the concept of covariance no longer exists in such non-ordinal space. Furthermore, all well-established statistical methods that require ordinal scale (e.g., regression and ANOVA) cannot be directly applied anymore.
Non-ordinal data have several variant names, such as categorical data, qualitative data, and nominal data. A common situation is a dataset is mixed with ordinal and non-ordinal data. On such a dataset, a common practice is to introduce coded (dummy) variables [10]. However, introducing dummy variables is equivalent to separating the mixed dataset according to the classes in non-ordinal variables to induce multiple purely ordinal subsets and then utilizing ordinal methods (such as regression analysis) case-by-case on the induced subsets. Unfortunately, this approach sometimes could be impractical because of the curse of dimensionality, particularly when there are too many categorical variables or when some categorical variable has too many categories (classes).
With the challenges from non-ordinal data, entropic statistics methods focus on underlying probability distribution instead of associated labels. As a result, all the entropic statistical quantities are location (permutation) invariant. The main strengths of entropic statistics lie within non-ordinal alphabets, or a mixture data space that significant bulk of information lies within the non-ordinal sub-space. For ordinal spaces, although ordinal variables can be binned as categorical variables, the strength of entropic statistics are generally incapable of overcoming the loss of ordinal information during discretization. Therefore, ordinal statistical methods are preferred when they are capable of the needs. In summary, potential scenarios for entropic statistics are:
The data lie within non-ordinal space.
The data are a mixture of ordinal and non-ordinal spaces, and the non-ordinal space is expected to carry unneglectable bulk of information.
The data lie within ordinal space, yet the performance of ordinal statistics methods fails to meet the expectation.
The following notations are used throughout the article. They are listed here for convenience.
Let and be two countable alphabets with cardinalities and , respectively.
Let the Cartesian product be with a joint probability distribution .
Let the two marginal distributions be respectively denoted by and where and ; hence X is a variable on with distribution and Y is a variable on with distribution .
For uni-variate situations, K stands for , and stands for .
Let be an independent and identically distributed () random sample of size n from . Let ; hence is the count of occurrence of letter in a sample. Let . is called the plug-in estimator of . Similarly, one can construct the plug-in estimators for and and name them as and , respectively.
For any two functions f and g taking values in with , the notation means
For any two functions f and g taking values in , the notation means
Many concepts discussed in the following sections have continuous counterparts under the same concept name. The results reviewed in this article focus on non-ordinal data space. Therefore, some notable results on ordinal space are not reviewed (for example, [11,12,13]). In Section 2, estimation on some classic entropic statistics quantities are discussed. Section 3 reviews estimation results and properties for some recently developed information-theoretic quantities. Entropic statistics’ application potentials in machine learning (ML) and knowledge extraction are discussed in Section 4. Finally, some remarks are given in Section 5.
2. Classic Entropic Statistics Quantities and Estimation
This section reviews three classic entropic concepts and their estimations, including Shannon’s entropy (Section 2.1.1) and mutual information (Section 2.1.2), and Kullback–Leibler divergence (Section 2.2). These three concepts are among the earliest entropic concepts and have been intensively studied over the past decades. Enormous amounts of statistical methods and computational algorithms are designed based on these three concepts [14,15,16]. Nevertheless, most of those methods and algorithms use naive plug-in estimation, which could be improved for a smaller estimation bias and better performance. For this reason, this section reviews several notable estimation methods as a reference. Some asymptotic properties are also presented as a reference. The asymptotic properties provide a theoretical guarantee for the corresponding estimators with statistical procedures such as hypothesis testing and confidence intervals.
2.1. Shannon’s Entropy and Mutual Information
2.1.1. Shannon’s Entropy
Established by Shannon in his landmark paper [1], the concept of entropy is the first and still the most important building brick in characterizing information from non-ordinal spaces. Many of the established information-theoretic quantities are linear functions of entropy. Shannon’s entropy, H, is defined as
Some remarkable properties of entropy are:
(Entropy).
- 1.
H is a measurement of dispersion. It is always non-negative by definition.
- 2.
H = 0 if and only if the probability of a letter l in is 1; hence no dispersion.
- 3.
For a finite alphabet with cardinality K, H is bounded from the above by , and the maximum is achieved when its distribution is uniform (); hence maximum dispersion.
- 4.
For a countably infinite alphabet, H may not exist (See Example 4 in Section 3).
Entropy Estimation-The Plug-in Estimator
Estimation of entropy has been a core research topic for decades. Due to the curse of “High Dimensionality” and “Discrete and Non-ordinal Nature”, entropy estimation is a technically difficult problem. Advances in this area have been slow to come. The plug-in estimator of entropy (also known as empirical entropy estimator), , defined as
is inarguably the most naive entropy estimator. has been studied thoroughly in recent decades. Ref. [17] provided the asymptotic properties for when K is finite, namely,(Asymptotic property of when K is finite).
where .Ref. [18] derived the bias of for finite K
(1)
Ref. [19] derived the asymptotic properties for when K is countable infinite. Namely,
(Asymptotic property of when K is countable infinite). For any nonuniform distribution satisfying , if there exists an integer-valued function such that, as ,
- 1.
,
- 2.
, and
- 3.
;
As discussed in [19], the conditions with hold if ; the conditions do not hold if .
Entropy Estimation-The Miller–Madow and Jackknife Estimators
[20] and [21] are two notable entropy estimators with bias adjustments. Namely,
(2)
where is the observed sample cardinality. For finite K, the bias of isis calculated in three steps:
for each , construct , which is a plug-in estimator based on a sub-sample of size obtained by leaving the ith observation out;
obtain for ; and then
compute the jackknife estimator
(3)
Equivalently, (3) can be written as
When , it can be shown that the bias of is
Asymptotic properties for and were derived in [22]. and reduce the rate of bias to a higher order power-decaying. Ref. [23] proved the convergence of could be arbitrarily slow. Ref. [24] proved that for finite K, an unbiased estimator for entropy does not exist. As a result, it is only possible to reduce the bias to a smaller extent.
Entropy Estimation-The Z-Estimator
Recent studies on entropy estimation have reduced the bias to exponentially decaying. For example,
is the entropy estimator provided in [25] with an exponentially decaying bias (Interested readers may refer to [26] for discussion on an entropy estimator that is algebraically equivalent to ). Ref. [27] derived the asymptotic properties for . Namely,(Asymptotic property of when K is finite).
where .The following asymptotic properties for when K is countable infinite were provided in [28].
(Asymptotic property of when K is countable infinite). For a nonuniform distribution satisfying , if there exists an integer-valued function such that, as ,
- 1.
,
- 2.
, and
- 3.
;
The sufficient condition given in Theorem 4 for the normality of is slightly more restrictive than that of the plug-in estimator as stated in Theorem 2, and consequently supports a smaller class of distributions. The sufficient conditions of Theorem 4 still holds for where , but not for , which satisfies the sufficient conditions of Theorem 2. However, it is discussed in [28] that simulation results indicate that the asymptotic normality of in Theorem 4 may still hold for for though not covered by the sufficient condition.
Remarks
Another perspective of entropy estimation is to combine and . Namely, one could use in place of each in (3). Interested readers may refer to [25] where a single layer combination of and was discussed. In addition, ref. [29] presented a non-parametric entropy estimator () when there are unseen species in the sample. has a smaller sample root mean squared error than and a smaller bias than , according to the simulation study. Unfortunately, the bias decaying rate for was not theoretically offered. Based on their simulation study of , it seems that the bias decaying rate is , which is slower than . Asymptotic properties of are not developed in the literature.
There are several parametric entropy estimators for specific interests. For example, Dirichlet prior Bayesian estimator of entropy [30,31] and shrinkage estimator of entropy [32]. This review article focuses on results from non-parametric estimation methods. To conclude this section, a small scale comparison between and from [33] is provided in Table 1.
2.1.2. Mutual Information
In the same paper defining Shannon’s entropy, the concept of Mutual Information (MI) was also described [1]. Shannon’s entropies for , , and are defined as
and MI between and is defined asSome notable properties of MI are:
(Mutual Information).
- 1.
MI is a measurement of dependence. It is always non-negative by definition.
- 2.
if and only if the two marginals are independent.
- 3.
if and only if the two marginals are dependent.
- 4.
A non-zero MI does not always indicate the degree (level) of dependence.
- 5.
MI may not exist when the cardinality of joint space is countably infinite.
MI Estimation-The Plug-in Estimator and Z-Estimator
Since MI is a function of entropy, estimation of MI is essentially entropy estimation. Let be an random sample of size n from the joint alphabet . Based on the sample, plug-in estimators of the component entropy of MI can be obtained. Namely,
where is the plug-in estimator for , is the plug-in estimator for , and is the plug-in estimator for . Then the plug-in estimator of mutual information between and is defined asWith various entropy estimation methods, one could estimate MI by replacing with a different entropy estimator. For example, using the entropy estimator with the fastest bias decaying rate, , the resulting estimator () also has a bias with an exponentially decaying rate [34], namely,
The asymptotic properties for -s ( and ) shall be discussed under two situations: (1) , and (2) .
The first situation of is used for testing independence. For example, in feature selection, irrelevant (to the outcome) non-ordinal features shall be dropped, and a feature is irrelevant if it is independent of the outcome. Let A be the potential irrelevant feature and B be the outcome; hence one must test against . To test such a hypothesis, one needs the asymptotic properties of -s under the null hypothesis: , derived in [35]. Namely,
(Asymptotic properties of and when ). Provided that ,
and where n is the sample size and stands for chi-squared distribution with degrees of freedom .For the second situation of (recall that if and only if the two marginals are dependent), the following asymptotic properties were due to [34].
Let
be the enumeration of joint probabilities plug-in estimators. Let and where . Then,(Asymptotic properties of and when ). Provided that ,
andThe following examples describe a proper use of MI and properties in Theorems 5 and 6.
(Genes TMEM30A and MTCH2—data and descriptions are in Example 1 of [34]). In the example, data were from two different genes in 191 patients. It has been calculated in [34] that . The hypothesis test in Example 1 of [35] gave a p-value of 0.0567, which suggests at . However, one shall use the property in Theorem 6 to obtain a confidence interval of . One must not use the property in Theorem 5 for the purpose of the confidence interval in this situation (because the asymptotic distribution in Theorem 5 assumes a specific location for under the null hypothesis).
(Genes ENAH and ENAH—data and descriptions are in Example 2 of [34]). In the example, data were from different probes of the same genes on 191 patients. It has been calculated in [34] that . The hypothesis test in Example 2 of [35] gave a p-value of 0.0012, which suggests at . Furthermore, one shall use the property in Theorem 6 to obtain a confidence interval of . One must not use the property in Theorem 5 for the purpose of the confidence interval in this situation.
(Compare the MI between Examples 1 and 2). From Examples 1 and 2, and . Although the second estimation value is higher, one cannot conclude that the level of dependence between and is higher than that between TMEM30A and MTCH2 due to the limitation described in the 4-th property in Property 2. To compare the level of dependence, one shall refer to the standardized mutual information in Section 3.1.
Recall that MI is always non-negative. For the same reason, is always non-negative (note that can be viewed as the for the distribution ). Nevertheless, can be negative under some scenarios. A negative suggests the level of dependence between the two random elements is extremely weak. If one uses the results from Theorem 5 to test if , a negative would lead to a fail-to-reject for most settings of (level of significance).
Remarks
There is another line of research on multivariate information-theoretic methods, the Partial Information Decomposition (PID) framework [36,37,38]. The PID may be viewed as a direct extension of MI to a measures of information provided by two or more variables about a third. Interesting applications of the PID are, for example, in explaining representation learning in neural networks [39] or in feature selection from dependent features [40]. PID aims to characterize redundancy with information decomposition. Another approach to characterize redundancy is to utilize MI on a joint feature space [33]. Additional research to compare the two approaches is needed.
2.2. Kullback–Leibler Divergence
Kullback–Leibler divergence (KL) [2], also known as relative entropy, is the distance between two probability distributions, introduced by [2], and is an important measure of information in information theory. The notations to define KL and describe its properties differ slightly from other sections. Let and be two discrete probability distributions on the same finite alphabet, , where is a finite integer. KL is defined to be
Note that many also use D as the notation of KL, namely, . KL is not a metric since it does not satisfy the triangle inequality and is not symmetric. Some notable properties of KL are:(Kullback–Leibler divergence).
- 1.
KL is a measurement of non-metric distance between two distributions on the same alphabet (with the same discrete support). It is always non-negative because of Gibbs’ inequality.
- 2.
if and only if the two underlying distributions are the same. Namely, for each .
- 3.
if and only if the two underlying distributions are different. Namely, for some k.
The use of KL has several variants, including but not limited to, (1) P and Q are unknown; (2) Q is known; (3) P and Q are continuous distributions. The second variant is an alternative method of the Pearson goodness-of-fit test. Interested readers may refer to [41] for more discussion on the second variant. Although utilizing entropic statistics on continuous spaces is generally not recommended, interested readers may refer to [42,43] for discussions on the third variant.
2.2.1. KL Point Estimation-The Plug-in Estimator, Augmented Estimator, and Z-Estimator
Although KL is not exactly a function of entropy, it still carries many similarities with entropy. For that reason, KL estimation is very similar to entropy estimation. For example, KL can be estimated from a plug-in perspective. Let be the plug-in estimator of and be the plug-in estimator of , then the KL plug-in estimator is
Because could have an infinite bias [44], an augmented plug-in estimator of KL was presented in [44]:
where and m is the sample size of the sample from Q. The bias of is no faster than , where n is the sample size of the sample from P [44].Since the could have an infinite bias, its estimation in perspectives of or will not help in reducing the bias to a finite extent. In the perspective of , a KL estimator with exponentially decaying bias was offered in [44]:
2.2.2. Symmetrized KL and Its Point Estimation
As mentioned in the first property of Property 3, KL is generally an asymmetric measurement. For certain interests that require a symmetric measurement, a symmetrized KL is defined to be
The symmetrized KL S, as a function of , can be similarly estimated in the perspective of , , and . The respective estimators are
where (n is the sample size of the sample from P), and2.2.3. Asymptotic Properties for KL and Symmetrized KL Estimators
The asymptotic properties for , , , , , and are all presented in [44]. All the asymptotic properties therein require (namely, ). When , the asymptotic property of KL (or S) estimators are currently missing from the literature. The derivation of such asymptotic properties is not complicated yet unnecessary. The only purpose of such asymptotic property under is to test if against . For such a purpose, the two-sample goodness-of-fit chi-squared test can be used (see p. 616 in [45]).
3. Recently Developed Entropic Statistics Quantities and Estimation
In this section, various recently developed entropic statistics quantities are introduced and discussed. Some quantities are quite new with limited estimation properties developed other than the plug-in estimation. Therefore, some of the following discussions focus on conceptual spirits and application potentials.
3.1. Standardized Mutual Information
Mutual information between two random elements (on non-ordinal alphabets) is similar to the covariance between two random variables (on ordinal spaces) regarding properties and drawbacks. For example, the covariance does not provide general information on the degree of correlation, and the concept correlation of coefficient was defined to fill the gap. Similarly, recall the fourth property of MI that MI generally does not provide information about the degree of dependence, standardized mutual information (SMI), , has been studied and defined in various ways. To name a few, provided ,
(4)
The quantity is also called information gain ratio [46]. The benefits of SMI are supported by Theorem 7.
(Theorem 5.4 in [28]). Suppose . Then
Moreover, (1) if and only if X and Y are independent, and (2) if and only if X and Y have a one-to-one correspondence.
Interested readers may refer to [47,48,49,50] for discussions on SMI. A detailed discussion of the estimation of various SMI may be found in [51].
3.2. Entropic Basis: A Generalization from Shannon’s Entropy
Shannon’s entropy and MI are powerful tools to quantify dispersion and dependence on non-ordinal space. More concepts and statistical tools are needed to characterize non-ordinal space information from different perspectives.
Generalized Simpson’s diversity indices were established in [52] and coined in [3].
(Generalized Simpson’s Diversity Indices). For a given and an integer pair , let . Let
be defined as the family of generalized Simpson’s diversity indices.Generalized Simpson’s Diversity Indices are the foundation of entropic basis and entropic moment. Interested readers may refer to [53] for discussions on entropic moments and a goodness-of-fit test under permutation based on entropic moments. In estimating ,
was derived in [52], where n is the sample size; u and v are given constants; is the sample proportion of the k-th letter(category); stands for indicator function. is an uniformly minimum-variance unbiased estimator (UMVUE) of for any combination of non-negative integers pair as long as , where n is the corresponding sample size.Based on , ref. [3] defined the entropic basis.
(Entropic Basis). Given Definition 1, the entropic basis is the sub-family
of ζ.All diversity indices can be represented as a function of [3] (most representations are due to Taylor’s expansion). For example,
Simpson’s index [54]:
Gini–Simpson index [54,55]:
Shannon’s entropy:
Rényi equiv. entropy [56]:
Emlen’s index [57]:
Richness index (population size):
Generalized Simpson’s index:
In practice, plug-in estimation is used in estimating diversity indices. The representations of the diversity index on an entropic basis allow a new estimation method with a smaller bias. Namely, , the UMVUE for , exist for all v up to . If one replaces with , and let all the other s (namely, ) to be zero, then the resulting estimator is exactly the same as plug-in estimator. However, the estimation can be further improved if one estimate based on .
For example, let (the observed number of categories) be the plug-in estimator of K. Meanwhile, (the estimator in perspective of entropic basis representation) is algebraically equivalent to [58]. Namely,
is a decomposition of K, then whereas has a smaller bias than . Interested readers may refer to [58] for details on the estimation of . Similar estimation could benefit the estimation of Rényi equiv. entropy, Emlen’s index, and any other diversity indices or theoretical quantities which contain the terms after Taylor’s expansion.3.3. Generalized Shannon’s Entropy and Generalized Mutual Information
Because of the advantages in characterizing information in non-ordinal space, Shannon’s entropy and MI have become the building blocks of information theory and essential aspects of ML methods. Yet, they are only finitely defined for distributions with fast decaying tails on a countable alphabet.
(Unbounded entropy). Let be a random variable following the distribution , where , and c is the constant to make the distribution valid (total probability add up to 1). Such a constant uniquely exists because the summation converges. Then
because andTherefore the entropy is unbounded.
The unboundedness of Shannon’s entropy and MI over the general class of all distributions on an alphabet prevents their potential utility from being fully realized. Ref. [5] proposed GSE and GMI, which are finitely defined everywhere. To state the definition of GSE and GMI, Definition 3 is stated first.
(Conditional Distribution of Total Collision (CDOTC)). Given and , consider the experiment of drawing an identically and independently distributed (iid) sample of size m (. Let denote the event that all observations of the sample take on the same letter in , and let be referred to as the event of a total collision. The conditional probability, given , that the total collision occurs at the letter is
where . is defined as the m-th order CDOTC.The idea of CDOTC is to adopt a special member of the family of the escort distributions introduced in [59]. The utility of CDOTC is endorsed by Lemmas 1 and 2, which are proved in [5].
For any order m, and uniquely determine each other.
For any order m, if and only if .
It is clear that is a probability distribution induced from . An example is provided to help understand Definition 3.
(The second-order CDOTC). Let , the second-order CDOTC is then defined as
where for .Based on Definition 3, GSE and GMI are defined as follows.
(Generalized Shannon’s Entropy (GSE)). Given , , and , generalized Shannon’s entropy (GSE) is defined as
where is defined in Definition 3, and is the order of GSE. GSE with order m is called the m-th order GSE.5(Generalized Mutual Information (GMI)). Let and . Let and . Let and be the marginal distributions of . The m-th order generalized mutual information (GMI) between X and Y is defined as
or equivalentlyTo help understand Definitions 4 and 5, Examples 6 and 7 are provided as follows.
(The second-order GSE). Let and . The second-order GSE, , is then defined as
where is given in Example 5.(The second-order GMI). Let , , and . Let
andFurther, let
and
The second-order GMI, , is then defined as
where , , and are Shannon’s entropy based on , , and , respectively.
GSE’s and GMI’s plug-in estimators are stated in Definitions 6 and 7.
(GSE’s plug-in estimators). Let be i.i.d. random variables taking values in with distribution . The plug-in estimator for is . The plug-in estimator for the m-th order GSE, , is
(GMI’s plug-in estimators). Let be i.i.d. random variables taking values in with distribution . Let be the plug-in estimator of . The plug-in estimator for the m-th order GMI, , is
whereThe following asymptotic properties for GSE’s plug-in estimators are given in [60].
Let be a probability distribution on a countably infinite alphabet , without any further conditions,
where
(5)
Let be a non-uniform probability distribution on a countably finite alphabet , without any further conditions,
where
The properties in Theorems 8 and 9 allow interval estimation and hypothesis testing with . The advantage of shifting the original distribution to an escort distribution is reflected in Theorem 8-the asymptotic normality requires no assumption on a countable infinite alphabet. Theorem 9 can be viewed as a special case of Theorem 8 under a finite situation, where the uniform distribution shall be omitted because a uniform distribution has no variation between different category probabilities and hence results in a zero GSE and degenerate asymptotic distribution.
Nevertheless, suppose one is certain that the cardinality of distribution is finite. In that case, one shall use Shannon’s entropy instead of GSE because Shannon’s entropy always exists under a finite distribution. There are various well-studied estimation methods with Shannon’s entropy (whereas only plug-in estimation on GSE has been studied by far).
Asymptotic properties for GMI plug-in estimator have not been studied yet. Nonetheless, a test of independence with modified GMI [61] has been studied. The test does not require the knowledge of the number of columns or rows of a contingency table; hence it yielded an alternative other than Pearson’s chi-squared test of independence, particularly when a contingency table is large or sparse.
4. Application of Entropic Statistics in Machine Learning and Knowledge Extraction
Applications of entropic statistics in ML and knowledge extraction can be clustered in two directions. The first direction is to solve an existing question from a new perspective by creating a new information-theoretic quantity [61] or revisiting an existing information-theoretic quantity for additional insights [62]. The second direction is to use different estimation methods in existing methods to improve the performance by reducing bias and/or variation [32]. Application potentials in the second direction are very promising because theoretical results from recent-developed estimation methods suggest the performance of many existing ML methods could be improved, yet not much research has been conducted in the direction. In this section, many established ML and knowledge extraction methods are discussed with their potential to improve in the second direction.
4.1. An Entropy-Based Random Forest Model
Ref. [63] proposed an entropy-importance-based random forest model for power quality feature selection and disturbance classification. The method used a greedy search based on entropy and information gain for node segmentation. Nevertheless, only the plug-in estimation of entropy and information gain was considered. The method could be improved by replacing the plug-in estimation with smaller bias estimation methods, such as in [25]. Further, one can also combine with the jackknife procedure in (3) to obtain
where , and use in place of the adopted plug-in estimation. The benefit of using is the potential smaller bias, and variance [25]. However, asymptotic properties for are yet developed. When asymptotic properties are desired (e.g., for confidence interval or hypothesis testing purposes), one shall consider estimators with established asymptotic properties (also called theoretical guarantee), such as , , , and .4.2. Feature Selection Methods
In [16,64], various information-theoretic feature selection methods were reviewed and discussed. The two review articles did not mention that all the discussed methods adopted plug-in estimators for the corresponding information-theoretic quantities. Improving the performance with different estimation methods is possible, and investigation is needed. For example, some of the discussed methods are summarized in Table 2 with suggestions to utilize smaller bias and/or variance estimation methods.
4.3. A Keyword Extraction Method
Ref. [79] proposed a keyword extraction method with Rényi’s entropy
and used the plug-in estimator therein. Namely, . Nevertheless, can be represented as where has the UMVUE for . For , could be estimated based on regression analysis [58]. Hence, can be estimated as where the construction of needs investigation using regression analysis [58]. The resulting would have a smaller bias than that of to help improve the established keyword extraction method. Note that if one wishes to use up to only, the resulting estimator would become(6)
(6) is the same as in [3]. Asymptotic properties for were provided therein (Corollary 3 in [3]) for interested readers.
5. Conclusions
Entropic statistics is effective in characterizing information from non-ordinal space. Meanwhile, it is essential to realize that non-ordinal information is inherently difficult to identify due to its non-ordinal and permutation invariant nature. This survey article aims to provide a comprehensive review of recent advances in entropic statistics, including classic entropic concepts estimation, recent-developed entropic statistics quantities, and their applications potentials in ML and knowledge extraction. This article first introduces the concept of entropic statistics and emphasizes challenges from non-ordinal data. Then this article reviews the estimation for classic entropic quantities. These classic entropic concepts, including Shannon’s entropy, MI, and KL, are widely used in established machine learning and knowledge extraction methods. Most, if not all, of the established methods use plug-in estimation, which is computation efficient yet with a large bias. The surveyed different estimation methods would help researchers to potentially improve existing methods’ performance by adopting a different estimation method or adding theoretical guarantee to the existing methods. Recent-developed entropic statistics concepts are also reviewed with their estimation and applications. These new concepts not only allow researchers to estimate existing quantities in a new perspective, but also support additional aspects in characterizing non-ordinal information. In particular, the generalized Simpson’s diversity indices (with the induced entropic basis and entropic moments) have significant application and theoretical potential to either customize existing ML and knowledge extraction methods or to establish new methods considering domain-specific challenges. Further, this article provides some examples of how to apply the surveyed results to some of the existing methods, including a random forest model, fourteen feature selection methods, and a keyword extraction model. It should be mentioned that the aim of the survey is not to claim the superiority of some estimation methods over others but to provide a comprehensive list of recent advances in entropic statistics research. Specifically, although an estimator with a faster-decaying bias seems theoretically preferred, it has a longer calculation time even with the convenient R functions, particularly when multiple layers of jackknife (bootstrap) are involved. The preference of estimation varies case by case—some may prefer an estimator with a smaller bias, some may prefer one with a smaller variance, while some may need a trade-off between them. Furthermore, the article focuses on non-parametric estimation, while parametric estimation would perform better if the specified model fits the domain-specific reality. In summary, one should always investigate if a new estimation method fits the needs.
Enormous additional works are still needed in entropic statistics. For example, (1) the asymptotic properties for many established estimators (such as and ) are not clear when cardinality is infinite. (2) With the transition from original distribution to escort distribution, GSE and GMI fill the void left by Shannon’s entropy and MI. However, only plug-in estimations of GSE and GMI have been studied. The biases of these plug-in estimators have not been studied, and additional estimation methods are undoubtedly needed. (3) Calculations for many entropic statistics are not yet supported in R, such as entropic basis, GSE, and GMI. Furthermore, more work is needed to implement the new entropic statistics concepts in programming software other than R (some of the reviewed estimators are implemented in R and are listed in Appendix A as a reference), particularly in Python. With additional theoretical development and application support, entropic statistics methods would be a more efficient tool to characterize more non-ordinal information and better serve the demands arose from the emerging domain-specific challenges.
Not applicable.
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
| ANOVA | Analysis of Variance |
| GSE | Generalized Shannon’s Entropy |
| GMI | Generalized Mutual Information |
| i.i.d. | independent and identically distributed |
| KL | Kullback–Leibler Divergence |
| MI | Mutual Information |
| ML | Machine Learning |
| PID | Partial Information Decomposition |
| SMI | Standardized Mutual Information |
| UMVUE | Uniformly Minimum-Variance Unbiased Estimator |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Estimation comparison between
| n | 100 | 300 | 500 | 1000 | 1500 | 2000 |
|---|---|---|---|---|---|---|
| avg. of |
4.56 | 5.57 | 6.00 | 6.51 | 6.75 | 6.89 |
| avg. of |
5.11 | 6.09 | 6.49 | 6.92 | 7.11 | 7.21 |
Selected information-theoretic feature selection methods reviewed in [
| MIM [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| IF [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| FCBF [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| AMIFS [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| CMIM [ |
Proposed Criterion (Score) | |
| Different Estimation Method | Use |
|
| MRMR [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| ICAP [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| CIFE [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| DISR [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| IGFS [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| SOA [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
|
| CMIFS [ |
Proposed Criterion (Score) |
|
| Different Estimation Method | Use |
Appendix A. R Functions
| Statistic | R Package Name | Function Name |
|
|
entropy [ |
entropy.plugin |
|
|
entropy | entropy.MillerMadow |
|
|
bootstrap [ |
jackknife |
|
|
entropy | entropy.ChaoShen |
|
|
EntropyEstimation [ |
Entropy.z |
| EntropyEstimation | Entropy.sd | |
|
|
entropy | mi.plugin |
|
|
EntropyEstimation | MI.z |
| EntropyEstimation | MI.sd | |
|
|
entropy | KL.plugin |
|
|
EntropyEstimation | KL.z |
|
|
EntropyEstimation | SymKL.plugin |
|
|
EntropyEstimation | SymKL.z |
|
|
EntropyEstimation | Renyi.z |
References
1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.; 1948; 27, pp. 379-423. [DOI: https://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x]
2. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat.; 1951; 22, pp. 79-86. [DOI: https://dx.doi.org/10.1214/aoms/1177729694]
3. Zhang, Z.; Grabchak, M. Entropic representation and estimation of diversity indices. J. Nonparametr. Stat.; 2016; 28, pp. 563-575. [DOI: https://dx.doi.org/10.1080/10485252.2016.1190357]
4. Grabchak, M.; Zhang, Z. Asymptotic normality for plug-in estimators of diversity indices on countable alphabets. J. Nonparametr. Stat.; 2018; 30, pp. 774-795. [DOI: https://dx.doi.org/10.1080/10485252.2018.1482294]
5. Zhang, Z. Generalized Mutual Information. Stats; 2020; 3, pp. 158-165. [DOI: https://dx.doi.org/10.3390/stats3020013]
6. Burnham, K.P.; Anderson, D.R. Practical use of the information-theoretic approach. Model Selection and Inference; Springer: Berlin/Heidelberg, Germany, 1998; pp. 75-117.
7. Dembo, A.; Cover, T.M.; Thomas, J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory; 1991; 37, pp. 1501-1518. [DOI: https://dx.doi.org/10.1109/18.104312]
8. Chatterjee, S.; Hadi, A.S. Regression Analysis by Example; John Wiley & Sons: Hoboken, NJ, USA, 2006.
9. Speed, T. What is an analysis of variance?. Ann. Stat.; 1987; 15, pp. 885-910. [DOI: https://dx.doi.org/10.1214/aos/1176350472]
10. Hardy, M.A. Regression with Dummy Variables; Sage: Newcastle upon Tyne, UK, 1993; Volume 93.
11. Kent, J.T. Information gain and a general measure of correlation. Biometrika; 1983; 70, pp. 163-173. [DOI: https://dx.doi.org/10.1093/biomet/70.1.163]
12. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory; 1991; 37, pp. 145-151. [DOI: https://dx.doi.org/10.1109/18.61115]
13. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory; 2014; 60, pp. 3797-3820. [DOI: https://dx.doi.org/10.1109/TIT.2014.2320500]
14. Sethi, I.K.; Sarvarayudu, G. Hierarchical classifier design using mutual information. IEEE Trans. Pattern Anal. Mach. Intell.; 1982; 4, pp. 441-445. [DOI: https://dx.doi.org/10.1109/TPAMI.1982.4767278] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21869061]
15. Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern.; 1991; 21, pp. 660-674. [DOI: https://dx.doi.org/10.1109/21.97458]
16. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR); 2017; 50, pp. 1-45. [DOI: https://dx.doi.org/10.1145/2996357]
17. Basharin, G.P. On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl.; 1959; 4, pp. 333-336. [DOI: https://dx.doi.org/10.1137/1104033]
18. Harris, B. The Statistical Estimation of Entropy in the Non-Parametric Case; Technical Report Wisconsin Univ-Madison Mathematics Research Center: Madison, WI, USA, 1975.
19. Zhang, Z.; Zhang, X. A normal law for the plug-in estimator of entropy. IEEE Trans. Inf. Theory; 2012; 58, pp. 2745-2747. [DOI: https://dx.doi.org/10.1109/TIT.2011.2179702]
20. Miller, G.A.; Madow, W.G. On the Maximum Likelihood Estimate of the Shannon-Weiner Measure of Information; Operational Applications Laboratory, Air Force Cambridge Research Center, Air Research and Development Command, Bolling Air Force Base: Washington, DC, USA, 1954.
21. Zahl, S. Jackknifing an index of diversity. Ecology; 1977; 58, pp. 907-913. [DOI: https://dx.doi.org/10.2307/1936227]
22. Chen, C.; Grabchak, M.; Stewart, A.; Zhang, J.; Zhang, Z. Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy; 2018; 20, 371. [DOI: https://dx.doi.org/10.3390/e20050371]
23. Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms; 2001; 19, pp. 163-193. [DOI: https://dx.doi.org/10.1002/rsa.10019]
24. Paninski, L. Estimation of entropy and mutual information. Neural Comput.; 2003; 15, pp. 1191-1253. [DOI: https://dx.doi.org/10.1162/089976603321780272]
25. Zhang, Z. Entropy estimation in Turing’s perspective. Neural Comput.; 2012; 24, pp. 1368-1389. [DOI: https://dx.doi.org/10.1162/NECO_a_00266]
26. Schürmann, T. A note on entropy estimation. Neural Comput.; 2015; 27, pp. 2097-2106. [DOI: https://dx.doi.org/10.1162/NECO_a_00775] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26313604]
27. Zhang, Z. Asymptotic normality of an entropy estimator with exponentially decaying bias. IEEE Trans. Inf. Theory; 2013; 59, pp. 504-508. [DOI: https://dx.doi.org/10.1109/TIT.2012.2217393]
28. Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons: Hoboken, NJ, USA, 2016.
29. Chao, A.; Shen, T.J. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ. Ecol. Stat.; 2003; 10, pp. 429-443. [DOI: https://dx.doi.org/10.1023/A:1026096204727]
30. Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. arXiv; 2001; arXiv: physics/0108025
31. Agresti, A.; Hitchcock, D.B. Bayesian inference for categorical data analysis. Stat. Methods Appl.; 2005; 14, pp. 297-330. [DOI: https://dx.doi.org/10.1007/s10260-005-0121-y]
32. Hausser, J.; Strimmer, K. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res.; 2009; 10, pp. 1469-1484.
33. Shi, J.; Zhang, J.; Ge, Y. CASMI—An Entropic Feature Selection Method in Turing’s Perspective. Entropy; 2019; 21, 1179. [DOI: https://dx.doi.org/10.3390/e21121179]
34. Zhang, Z.; Zheng, L. A mutual information estimator with exponentially decaying bias. Stat. Appl. Genet. Mol. Biol.; 2015; 14, pp. 243-252. [DOI: https://dx.doi.org/10.1515/sagmb-2014-0047]
35. Zhang, J.; Chen, C. On “A mutual information estimator with exponentially decaying bias” by Zhang and Zheng. Stat. Appl. Genet. Mol. Biol.; 2018; 17, 20180005. [DOI: https://dx.doi.org/10.1515/sagmb-2018-0005]
36. Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv; 2010; arXiv: 1004.2515
37. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy; 2014; 16, pp. 2161-2183. [DOI: https://dx.doi.org/10.3390/e16042161]
38. Griffith, V.; Koch, C. Quantifying synergistic mutual information. Guided Self-Organization: Inception; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159-190.
39. Tax, T.M.; Mediano, P.A.; Shanahan, M. The partial information decomposition of generative neural network models. Entropy; 2017; 19, 474. [DOI: https://dx.doi.org/10.3390/e19090474]
40. Wollstadt, P.; Schmitt, S.; Wibral, M. A rigorous information-theoretic definition of redundancy and relevancy in feature selection based on (partial) information decomposition. arXiv; 2021; arXiv: 2105.04187
41. Mori, T.; Nishikimi, K.; Smith, T.E. A divergence statistic for industrial localization. Rev. Econ. Stat.; 2005; 87, pp. 635-651. [DOI: https://dx.doi.org/10.1162/003465305775098170]
42. Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-Nearest-Neighbor distances. IEEE Trans. Inf. Theory; 2009; 55, pp. 2392-2405. [DOI: https://dx.doi.org/10.1109/TIT.2009.2016060]
43. Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory; 2010; 56, pp. 5847-5861. [DOI: https://dx.doi.org/10.1109/TIT.2010.2068870]
44. Zhang, Z.; Grabchak, M. Nonparametric estimation of Küllback-Leibler divergence. Neural Comput.; 2014; 26, pp. 2570-2593. [DOI: https://dx.doi.org/10.1162/NECO_a_00646]
45. Press, W.H.; Teukolsky Saul, A. Numerical Recipes in Fortran: The Art of Scientific Computing; Cambridge University Press: Cambridge, UK, 1993.
46. De Mántaras, R.L. A distance-based attribute selection measure for decision tree induction. Mach. Learn.; 1991; 6, pp. 81-92. [DOI: https://dx.doi.org/10.1023/A:1022694001379]
47. Kvalseth, T.O. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybern.; 1987; 17, pp. 517-519. [DOI: https://dx.doi.org/10.1109/TSMC.1987.4309069]
48. Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res.; 2002; 3, pp. 583-617.
49. Yao, Y. Information-theoretic measures for knowledge discovery and data mining. Entropy Measures, Maximum Entropy Principle and Emerging Applications; Springer: Berlin/Heidelberg, Germany, 2003; pp. 115-136.
50. Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res.; 2010; 11, pp. 2837-2854.
51. Zhang, Z.; Stewart, A.M. Estimation of Standardized Mutual Information; Technical Report UNC Charlotte Technical Report: Charlotte, NC, USA, 2016.
52. Zhang, Z.; Zhou, J. Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Inference; 2010; 140, pp. 1731-1738. [DOI: https://dx.doi.org/10.1016/j.jspi.2009.12.023]
53. Chen, C. Goodness-of-Fit Tests under Permutations. Ph.D. Thesis; The University of North Carolina at Charlotte: Charlotte, NC, USA, 2019.
54. Simpson, E.H. Measurement of diversity. Nature; 1949; 163, 688. [DOI: https://dx.doi.org/10.1038/163688a0]
55. Gini, C. Measurement of inequality of incomes. Econ. J.; 1921; 31, pp. 124-126. [DOI: https://dx.doi.org/10.2307/2223319]
56. Rényi, A. On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley, CA, USA, 1 January 1961; Volume 1.
57. Emlen, J.M. Ecology: An Evolutionary Approach; Addison-Wesley: Boston, MA, USA, 1977.
58. Zhang, Z.; Chen, C.; Zhang, J. Estimation of population size in entropic perspective. Commun.-Stat.-Theory Methods; 2020; 49, pp. 307-324. [DOI: https://dx.doi.org/10.1080/03610926.2018.1536786]
59. Beck, C.; Schögl, F. Thermodynamics of Chaotic Systems; Cambridge University Press: Cambridge, UK, 1995.
60. Zhang, J.; Shi, J. Asymptotic Normality for Plug-In Estimators of Generalized Shannon’s Entropy. Entropy; 2022; 24, 683. [DOI: https://dx.doi.org/10.3390/e24050683]
61. Zhang, J.; Zhang, Z. A Normal Test for Independence via Generalized Mutual Information. arXiv; 2022; arXiv: 2207.09541
62. Kontoyiannis, I.; Skoularidou, M. Estimating the directed information and testing for causality. IEEE Trans. Inf. Theory; 2016; 62, pp. 6053-6067. [DOI: https://dx.doi.org/10.1109/TIT.2016.2604842]
63. Huang, N.; Lu, G.; Cai, G.; Xu, D.; Xu, J.; Li, F.; Zhang, L. Feature selection of power quality disturbance signals with an entropy-importance-based random forest. Entropy; 2016; 18, 44. [DOI: https://dx.doi.org/10.3390/e18020044]
64. Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res.; 2012; 13, pp. 27-66.
65. Lewis, D.D. Feature selection and feature extraction for text categorization. Proceedings of the Speech and Natural Language: Proceedings of a Workshop Held at Harriman; New York, NY, USA, 23–26 February 1992.
66. Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw.; 1994; 5, pp. 537-550. [DOI: https://dx.doi.org/10.1109/72.298224] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/18267827]
67. Yang, H.; Moody, J. Feature selection based on joint mutual information. Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis; Rochester, NY, USA, 22–25 June 1999; Volume 1999, pp. 22-25.
68. Ullman, S.; Vidal-Naquet, M.; Sali, E. Visual features of intermediate complexity and their use in classification. Nat. Neurosci.; 2002; 5, pp. 682-687. [DOI: https://dx.doi.org/10.1038/nn870] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/12055634]
69. Yu, L.; Liu, H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res.; 2004; 5, pp. 1205-1224.
70. Tesmer, M.; Estévez, P.A. AMIFS: Adaptive feature selection by using mutual information. Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541); Budapest, Hungary, 25–29 July 2004; Volume 1, pp. 303-308.
71. Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res.; 2004; 5, pp. 1531-1555.
72. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.; 2005; 27, pp. 1226-1238. [DOI: https://dx.doi.org/10.1109/TPAMI.2005.159]
73. Jakulin, A. Machine Learning Based on Attribute Interactions. Ph.D. Thesis; Univerza v Ljubljani: Ljubljana, Slovenia, 2005.
74. Lin, D.; Tang, X. Conditional infomax learning: An integrated framework for feature extraction and fusion. Proceedings of the European Conference on Computer Vision; Graz, Austria, 7–13 May 2006; pp. 68-82.
75. Meyer, P.E.; Bontempi, G. On the use of variable complementarity for feature selection in cancer classification. Proceedings of the Workshops on Applications of Evolutionary Computation; Budapest, Hungary, 10–12 April 2006; pp. 91-102.
76. El Akadi, A.; El Ouardighi, A.; Aboutajdine, D. A powerful feature selection approach based on mutual information. Int. J. Comput. Sci. Netw. Secur.; 2008; 8, 116.
77. Guo, B.; Nixon, M.S. Gait feature subset selection by mutual information. IEEE Trans. Syst. Man-Cybern.-Part Syst. Hum.; 2008; 39, pp. 36-46.
78. Cheng, G.; Qin, Z.; Feng, C.; Wang, Y.; Li, F. Conditional Mutual Information-Based Feature Selection Analyzing for Synergy and Redundancy. Etri J.; 2011; 33, pp. 210-218. [DOI: https://dx.doi.org/10.4218/etrij.11.0110.0237]
79. Singhal, A.; Sharma, D. Keyword extraction using Renyi entropy: A statistical and domain independent method. Proceedings of the 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS); Coimbatore, India, 19–20 March 2021; Volume 1, pp. 1970-1975.
80. R Package Entropy. Available online: https://cran.r-project.org/web/packages/entropy/index.html (accessed on 27 September 2022).
81. R Package Bootstrap. Available online: https://cran.r-project.org/web/packages/bootstrap/index.html (accessed on 27 September 2022).
82. R Package EntropyEstimation. Available online: https://cran.r-project.org/web/packages/EntropyEstimation/index.html (accessed on 27 September 2022).
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.