1. Introduction
Cluster analysis of different objects is a branch of machine learning where the teacher’s labels are replaced by some internal characteristics of objects or external characteristics of clusters. The internal ones include the distances between objects within the cluster [1,2] and the similarity of objects [3]. Among the external characteristics, we mention the distances between the clusters [4]. As a mathematical problem, clustering has no universal statement. Therefore, clustering algorithms are often heuristic [5,6].
A highly developed area of research is cluster analysis of large text arrays. As a rule, latent features are first detected based on latent semantic analysis [7]. Subsequently, they are used for clustering [8,9]. Recently, there have appeared works based on the concept of ensemble clustering [10,11].
Most clustering algorithms involve the distance between objects, measured in an accepted metric, and enumerative search algorithms with heuristic control [12]. Clustering results significantly depend on the metric. Therefore, it is very important to quantify the quality of clustering [13,14,15].
This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution [16]. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. Since clusters are treated as random objects, the indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds in size and composition to the maximum of the optimal probability distribution.
The optimal distribution of clusters is based on the method of the randomized maximum entropy estimation (MEE method) described in [16]. The method turns out to be effective in many machine learning and data mining problems. Among other features, it introduces the problem of entropy-randomized clustering. This article is devoted to a more detailed presentation of this problem in terms of proving the convergence of the multiplicative algorithm and the logical scheme of the clustering procedures.
2. An Indicator of Data Matrices
Consider a set of n objects characterized by row vectors from the feature space . Using these vectors, we construct the following n-row matrix:
(1)
Let the distance between the ith and jth rows be defined as(2)
where denotes an appropriate metric in the feature space . Next, we construct the distance matrix(3)
We introduce an indicator of the matrix as the average value of the elements of the distance matrix :(4)
Below, the objects will be included in clusters depending on the distances in (2). Therefore, the important characteristics of the matrix are the minimum and maximum elements of the distance matrix :(5)
Note that the elements of the distance matrices of the clusters belong to the interval3. Randomized Binary Clustering
The binary clustering problem is to arrange n objects between two clusters and of sizes and respectively:
(6)
It is required to find the size and composition of the clusterFor each fixed cluster size the clustering procedure consists of selecting a submatrix of some rows from the matrix If the matrix is selected, then the remaining rows form the matrix and the set of their numbers form the cluster
Clearly, the matrix can be formed from the rows of the original matrix in different ways (the number of s-combinations from the set of n elements). For each of them, the matrices can be formed in a corresponding number of ways.
According to the principle of randomized binary clustering, the matrix is a random object and its particular images are the realizations of this object. The sets of its elements and the number of rows s are therefore random.
A realization of the random object is a set of row vectors from the original matrix:
(7)
We renumber this set as follows:(8)
Thus, the randomization procedure yields a finite ensemble of the form(9)
Recall that the matrices in this ensemble are random. Hence, we assume the existence of probabilities for realizing the ensemble elements, where s and k denote the cluster size and cluster realization number, respectively:(10)
Then, the randomized binary clustering problem reduces to determining a discrete probability distribution , which is appropriate in some sense.
Let such a function be obtained; according to the general variational principle of statistical mechanics, the realized matrix will be
(11)
This matrix corresponds to the most probable cluster of objects with the numbers(12)
The other cluster consists of the remaining objects with the numbers(13)
Generally speaking, there are many such clusters but they all contain the same objects.
4. Entropy-Optimal Distribution
Consider the cluster of size the associated matrix
(14)
and the distance matrix(15)
We define the matrix indicator in (4) for the cluster as(16)
Since the matrices are supposed random objects, their ensemble has a probability distribution . We introduce the average indicator in the form(17)
For determining the discrete probability distribution we apply randomized machine learning with the Boltzmann–Shannon entropy functional [17]:
(18)
subject to the constraints(19)
(20)
Here, the lower and upper bounds for the elements of the distance matrix are given by (5); the indicator is given by (16).5. Parametrical Problems (18)–(20)
We treat Equations (18)–(20) as finite-dimensional: the objective function (entropy) and the constraints both depend on the finite-dimensional vector composed of the values of the two-dimensional probability distribution :
(21)
The dimension of this vector is(22)
The constraints in (19) can be omitted by considering the Fermi entropy [18] as the objective function. Performing standard transformations, we arrive at a finite-dimensional entropy-linear programming problem [19] with the form(23)
To solve this problem, we employ the Karush–Kuhn–Tucker theorem [20], expressing the optimality conditions in terms of Lagrange multipliers and a Lagrange function. For Equation (23), the Lagrange function has the form
(24)
The optimality conditions for the saddle point of the Lagrange function in (24) are written as(25)
(26)
The first condition in (25) is analytically solvable with respect to the components of the vector :(27)
The second condition in (25) yields the inequalities(28)
and the condition in (26) yields the following equations:(29)
The non-negative solution of these inequalities and equations can be found using a multiplicative algorithm [19] with the form(30)
Here, is a parameter assigned based on the -convergence conditions of the iterative process in (30).The algorithm in (30) is said to be -convergent if there exists a set in the space and scalars and such that, for all and this algorithm converges to the solution of Equation (30), and the rate of convergence in the neighborhood of is linear.
The algorithm in (30) is -convergent to the solution of Equation (29).
Consider an auxiliary system of differential equations obtained from (30) as :
(31)
First, we have to establish its stability in the large, i.e., under any initial deviations in the spaceSecond, we have to demonstrate that the algorithm in (30) is a Euler difference scheme for Equation (31) with an appropriate value
Let us describe some details of the proof. We define the following function in :
The function is strictly convex in . Its Hessian is Hence, and it is achieved at the point . Thus, , and .We define the time derivative along the trajectories of (31):
According to (28) on Hence, the function V is a Lyapunov function for Equation (31) in the space . All solutions of Equation (31) are asymptotically stable under any initial conditions andThe algorithm in (30) is a Euler difference scheme. Due to the asymptotic stability of the solutions of (31), there always exists a step and an initial condition domain under which the Euler scheme will converge. □
By the general principle of statistical mechanics, the realized cluster corresponds to the maximum of the probability distribution:
(32)
5.1. Randomized Binary Clustering Algorithms
In this section, the algorithms of the clustering procedures in terms of logical schemes are proposed.
5.1.1. Algorithm with a Given Cluster Size s
-
1.. Calculating the numerical characteristics of the data matrix
-
(a). Constructing the row vectors
-
(b). Calculating the elements of the (Euclidean) distance matrix in (3):
-
(c). Calculating the data matrix indicator in (4):
-
(d). Calculating the upper and lower bounds for the elements of the matrix in (3):
-
-
2.. Forming the matrix ensemble
-
(a). Forming the correspondence table
-
(b). Constructing the matrices
-
(c). Calculating the elements of the distance matrices in (15):
-
(d). Calculating the indicator of the matrix :
-
-
3.. Determining the Lagrange multipliers and for the finite-dimensional problem
-
(a). Specifying the initial values for the Lagrange multipliers:
-
(b). Applying the iterative algorithm in (30):
where -
(c). Determining the optimal probability distribution:
-
(d). Determining the most probable cluster :
-
(e). Determining the cluster :
-
5.1.2. Algorithm with an Unknown Cluster Size
-
1.. Applying step 1 of
-
2.. Organizing a loop with respect to the cluster size
-
(a). Applying step 2 of
-
(b). Applying step 3 of
-
(c). Putting into the memory.
-
(d). Calculating the conditionally maximum value of the entropy:
-
(e). Putting in the memory.
-
(f). If , then returning to Step 2a.
-
(g). Determining the maximum element of the array :
-
(h). Extracting the probability distribution
-
(i). Executing Steps 3d and 3e of :
-
6. Functional Problems (18)–(20)
Consider a parametric family of all constrained entropy maximization problems with the form
(33)
The solutions of (22) and (23) will coincide under an unknown value of the parameter It can be determined by solving Equation (23) and fixing the values of the entropy functional. Its maximum value will correspond to the desired valueLet us turn to Equation (23) with a fixed value of the parameter It belongs to the class of Lyapunov-type problems [21]. We define a Lagrange functional as
(34)
Using the technique of Gâteaux derivatives, we obtain the stationarity conditions for the functional in (24) in the primal (functional) and dual (scalar) variables; for details, see [22,23]. The resulting optimal distribution parameterized by the Lagrange multiplier is given by(35)
The Lagrange multiplier satisfies the equation(36)
The solution of this equation belongs to and depends on Hence, the value of the entropy functional depends on We choose
(37)
The randomized binary clustering procedure can be repeated times to form t clusters. At each stage, two new clusters are generated from the remaining objects of the previous stage.
7. Illustrative Examples
Consider the binary clustering of iris flowers using Fisher’s Iris dataset (in this dataset, iris flowers are described by the petal width and the petal length ). The database contains this feature information for three types of flowers: “setosa” (1), “versicolor” (2), and “virginica” (3), in the amount of 50 two-dimensional points for each species. Below, we study types 1 and 2 and 10 data points for each type.
The data matrix contains the numerical values of the two features for types 1 and 2; see Table 1.
Figure 1 shows the arrangement of the data points on the plane.
First, we apply the algorithm see Section 5.1.
The minimum and maximum elements are
(38)
The data matrix indicator is Let .The ensemble of possible clusters has the size . The cluster with number has the form . The distance matrix is presented in Table 2.
The indicator of the matrix corresponding to the cluster is
(39)
The indicators for the clusters are shown in Figure 2.
The entropy-optimal probability distribution for has the form
(40)
The cluster with the maximum probability is numbered by
(41)
The cluster consists of the following data points: .The arrangement of the clusters and is shown in Figure 3.
A direct comparison with Figure 1 indicates a perfect match of 10/10: no clustering errors.
Consider another data matrix from the same dataset (Table 3).
Figure 4 shows the arrangement of the data points.
Similar to Example 1, we apply the algorithm
We construct the distance matrix and find the minimum and maximum elements:
(42)
Let .The ensemble of possible clusters has the size . The indicators for the clusters are shown in Figure 5.
The entropy-optimal probability distribution for has the form
(43)
The cluster with the maximum probability is numbered by
(44)
It consists of the following data points: The cluster consists of the following data points:The arrangement of the clusters and is shown in Figure 6. A direct comparison with Figure 4 indicates a match of 8/10.
8. Discussion and Conclusions
This paper has developed a novel concept of clustering. Its fundamental difference from the conventional approaches is the generation of an ensemble of random clusters, accompanied by the matrices of inter-object distances averaged over the entire ensemble (the so-called indicators). Random clusters are parameterized by the number of objects s and their set . Therefore, the ensemble’s characteristic is the probability distribution of the clusters in the ensemble, which depends on s and k. A generalized variational principle of statistical mechanics has been proposed to find this distribution. It consists of the conditional maximization of the Boltzmann–Shannon entropy. Algorithms for solving the finite-dimensional and functional optimization problems have been developed.
An advantage of the novel randomized clustering method is complete algorithmization, independent of the properties of the clustered objects data. All existing clustering methods involve, more or less, various empirical techniques related to data properties.
However, this method requires high computational resources to form an ensemble of random clusters and their indicators.
Conceptualization, Y.S.P.; Data curation, A.Y.P.; Methodology, Y.S.P., A.Y.P., and Y.A.D.; Software, A.Y.P. and Y.A.D.; Supervision, Y.S.P.; Writing—original draft, Y.S.P., A.Y.P. and Y.A.D. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
Not applicable.
The authors declare no conflict of interest.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Data matrix.
No. |
|
|
Type |
---|---|---|---|
1 | 4.5 | 1.5 | 2 |
2 | 4.6 | 1.5 | 2 |
3 | 4.7 | 1.4 | 2 |
4 | 1.7 | 0.4 | 1 |
5 | 1.3 | 0.2 | 1 |
6 | 1.4 | 0.3 | 1 |
7 | 1.5 | 0.2 | 1 |
8 | 3.9 | 1.4 | 2 |
9 | 4.5 | 1.3 | 2 |
10 | 4.6 | 1.3 | 2 |
11 | 1.4 | 0.2 | 1 |
12 | 4.7 | 1.6 | 2 |
13 | 4.0 | 1.3 | 2 |
14 | 1.4 | 0.2 | 1 |
15 | 1.4 | 0.2 | 1 |
16 | 1.5 | 0.2 | 1 |
17 | 1.5 | 0.1 | 1 |
18 | 4.9 | 1.5 | 2 |
19 | 3.3 | 1.0 | 2 |
20 | 1.4 | 0.2 | 1 |
Distance matrix for cluster
No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0.1 | 0.22 | 3.01 | 3.45 | 3.32 | 3.27 | 3.36 | 3.36 | 3.36 |
2 | 0.1 | 0 | 0.14 | 3.1 | 3.55 | 3.42 | 3.36 | 3.45 | 3.45 | 3.45 |
3 | 0.22 | 0.14 | 0 | 3.16 | 3.61 | 3.48 | 3.42 | 3.51 | 3.51 | 3.51 |
4 | 3.01 | 3.1 | 3.16 | 0 | 0.45 | 0.32 | 0.28 | 0.36 | 0.36 | 0.36 |
5 | 3.45 | 3.55 | 3.61 | 0.45 | 0 | 0.14 | 0.2 | 0.1 | 0.1 | 0.1 |
6 | 3.32 | 3.42 | 3.48 | 0.32 | 0.14 | 0 | 0.14 | 0.1 | 0.1 | 0.1 |
7 | 3.27 | 3.36 | 3.42 | 0.28 | 0.2 | 0.14 | 0 | 0.1 | 0.1 | 0.1 |
8 | 3.36 | 3.45 | 3.51 | 0.36 | 0.1 | 0.1 | 0.1 | 0 | 0 | 0 |
9 | 3.36 | 3.45 | 3.51 | 0.36 | 0.1 | 0.1 | 0.1 | 0 | 0 | 0 |
10 | 3.36 | 3.45 | 3.51 | 0.36 | 0.1 | 0.1 | 0.1 | 0 | 0 | 0 |
Data matrix.
No. |
|
|
Type |
---|---|---|---|
1 | 6.4 | 3.2 | 2 |
2 | 6.5 | 2.8 | 2 |
3 | 7.0 | 3.2 | 2 |
4 | 5.4 | 3.9 | 1 |
5 | 4.7 | 3.2 | 1 |
6 | 4.6 | 3.4 | 1 |
7 | 4.6 | 3.1 | 1 |
8 | 5.2 | 2.7 | 2 |
9 | 5.7 | 2.8 | 2 |
10 | 6.6 | 2.9 | 2 |
11 | 5.1 | 3.5 | 1 |
12 | 6.3 | 3.3 | 2 |
13 | 5.5 | 2.3 | 2 |
14 | 4.4 | 2.9 | 1 |
15 | 4.9 | 3.0 | 1 |
16 | 5.0 | 3.4 | 1 |
17 | 4.9 | 3.1 | 1 |
18 | 6.9 | 3.1 | 2 |
19 | 4.9 | 2.4 | 2 |
20 | 5.0 | 3.6 | 1 |
References
1. Mandel, I.D. Klasternyi Analiz (Cluster Analysis); Finansy i Statistika: Moscow, Russia, 1988.
2. Zagoruiko, N.G. Kognitivnyi Analiz Dannykh (Cognitive Data Analysis); GEO: Novosibirsk, Russia, 2012.
3. Zagoruiko, N.G.; Barakhnin, V.B.; Borisova, I.A.; Tkachev, D.A. Clusterization of Text Documents from the Database of Publications Using FRiS-Tax Algorithm. Comput. Technol.; 2013; 18, pp. 62-74.
4. Jain, A.; Murty, M.; Flynn, P. Data Clustering: A Review. ACM Comput. Surv.; 1990; 31, pp. 264-323. [DOI: https://dx.doi.org/10.1145/331499.331504]
5. Vorontsov, K.V. Lektsii po Algoritmam Klasterizatsii i Mnogomernomu Shkalirovaniyu (Lectures on Clustering Algorithms and Multidimensional Scaling); Moscow State University: Moscow, Russia, 2007.
6. Lescovec, J.; Rajaraman, A.; Ullman, J. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2014.
7. Deerwester, S.; Dumias, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci.; 1999; 41, pp. 391-407. [DOI: https://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9]
8. Zamir, O.E. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. Ph.D. Thesis; The Univeristy of Washington: Seattle, WA, USA, 1999.
9. Cao, G.; Song, D.; Bruza, P. Suffix-Tree Clustering on Post-retrieval Documents Information; The Univeristy of Queensland: Brisbane, QLD, Australia, 2003.
10. Huang, D.; Wang, C.D.; Lai, J.H.; Kwoh, C.K. Toward multidiversified ensemble clustering of high-dimensional data: From subspaces to metrics and beyond. IEEE Trans. Cybern.; 2021; [DOI: https://dx.doi.org/10.1109/TCYB.2021.3049633] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33961570]
11. Khan, I.; Luo, Z.; Shaikh, A.K.; Hedjam, R. Ensemble clustering using extended fuzzy k-means for cancer data analysis. Expert Syst. Appl.; 2021; 172, 114622. [DOI: https://dx.doi.org/10.1016/j.eswa.2021.114622]
12. Jain, A.; Dubs, R. Clustering Methods and Algorithms; Prentice-Hall: Hoboken, NJ, USA, 1988.
13. Pal, N.R.; Biswas, J. Cluster Validation Using Graph Theoretic Concept. Pattern Recognit.; 1997; 30, pp. 847-857. [DOI: https://dx.doi.org/10.1016/S0031-3203(96)00127-6]
14. Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On Clustering Validation Techniques. J. Intell. Inf. Syst.; 2001; 17, pp. 107-145. [DOI: https://dx.doi.org/10.1023/A:1012801612483]
15. Han, J.; Kamber, M.; Pei, J. Data Mining Concept and Techniques; Morgan Kaufmann Publishers: Burlington, MA, USA, 2012.
16. Popkov, Y.S. Randomization and Entropy in Machine Learning and Data Processing. Dokl. Math.; 2022; 105, pp. 135-157. [DOI: https://dx.doi.org/10.1134/S1064562422030073]
17. Popkov, Y.S.; Dubnov, Y.A.; Popkov, A.Y. Introduction to the Theory of Randomized Machine Learning. Learning Systems: From Theory to Practice; Sgurev, V.; Piuri, V.; Jotsov, V. Springer International Publishing: Cham, Switzerland, 2018; pp. 199-220. [DOI: https://dx.doi.org/10.1007/978-3-319-75181-8_10]
18. Popkov, Y.S. Macrosystems Theory and Its Applications (Lecture Notes in Control and Information Sciences Vol 203); Springer: Berlin, Germany, 1995.
19. Popkov, Y.S. Multiplicative Methods for Entropy Programming Problems and their Applications. Proceedings of the 2010 IEEE International Conference on Industrial Engineering and Engineering Management; Xiamen, China, 29–31 October 2010; pp. 1358-1362. [DOI: https://dx.doi.org/10.1109/IEEM.2010.5674404]
20. Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987.
21. Joffe, A.D.; Tihomirov, A.M. Teoriya Ekstremalnykh Zadach (Theory of Extreme Problems); Nauka: Moscow, Russia, 1974.
22. Tihomirov, V.M.; Alekseev, V.N.; Fomin, S.V. Optimal Control; Nauka: Moscow, Russia, 1979.
23. Popkov, Y.; Popkov, A. New methods of entropy-robust estimation for randomized models under limited data. Entropy; 2014; 16, pp. 675-698. [DOI: https://dx.doi.org/10.3390/e16020675]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. The indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds to the maximum of the optimal probability distribution. This method is developed for binary clustering as a basic procedure. Its extension to t-ary clustering is considered. Some illustrative examples of entropy-randomized clustering are given.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer