Full text

Turn on search term navigation

1. Introduction

Cluster analysis of different objects is a branch of machine learning where the teacher’s labels are replaced by some internal characteristics of objects or external characteristics of clusters. The internal ones include the distances between objects within the cluster [1,2] and the similarity of objects [3]. Among the external characteristics, we mention the distances between the clusters [4]. As a mathematical problem, clustering has no universal statement. Therefore, clustering algorithms are often heuristic [5,6].

A highly developed area of research is cluster analysis of large text arrays. As a rule, latent features are first detected based on latent semantic analysis [7]. Subsequently, they are used for clustering [8,9]. Recently, there have appeared works based on the concept of ensemble clustering [10,11].

Most clustering algorithms involve the distance between objects, measured in an accepted metric, and enumerative search algorithms with heuristic control [12]. Clustering results significantly depend on the metric. Therefore, it is very important to quantify the quality of clustering [13,14,15].

This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution [16]. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. Since clusters are treated as random objects, the indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds in size and composition to the maximum of the optimal probability distribution.

The optimal distribution of clusters is based on the method of the randomized maximum entropy estimation (MEE method) described in [16]. The method turns out to be effective in many machine learning and data mining problems. Among other features, it introduces the problem of entropy-randomized clustering. This article is devoted to a more detailed presentation of this problem in terms of proving the convergence of the multiplicative algorithm and the logical scheme of the clustering procedures.

2. An Indicator of Data Matrices

Consider a set of n objects characterized by row vectors $x^{(1)}, \dots, x^{(n)}$ from the feature space $R^{m}$ . Using these vectors, we construct the following n-row matrix:

(1) $X^{(1, \dots, n)} = (\begin{matrix} x^{(1)} \\ \dots \\ x^{(n)} \end{matrix}) .$

Let the distance between the ith and jth rows be defined as

(2) $ϱ (x^{(i)}, x^{(j)}) = {∥ x^{(i)} - x^{(j)} ∥}_{R^{m}},$

where

{∥ • ∥}_{R^{m}}

denotes an appropriate metric in the feature space

R^{m}

. Next, we construct the distance matrix

(3) $D_{(n \times n)} = (\begin{matrix} 0 & ϱ (x^{(1)}, x^{(2)}) & \dots & ϱ (x^{(1)}, x^{(n)}) \\ ϱ (x^{(2)}, x^{(1)}) & 0 & \dots & ϱ (x^{(2)}, x^{(n)}) \\ \dots & \dots & \dots \\ ϱ (x^{(n)}, x^{(1)}) & ϱ (x^{(n)}, x^{(2)}) & \dots & 0 \end{matrix}) .$

We introduce an indicator of the matrix

X^{(1, \dots, n)}

as the average value of the elements of the distance matrix

D

(4) $d i s (X) = \frac{2}{n (n - 1)} \sum_{(i, j) = 1, j \neq i}^{n} ϱ (x^{(i)}, x^{(j)}) .$

Below, the objects will be included in clusters depending on the distances in (2). Therefore, the important characteristics of the matrix

X^{(1, \dots, n)}

are the minimum and maximum elements of the distance matrix

D

(5) $i n f (D) = min_{i, j} ϱ (x^{(i)}, x^{(j)}), s u p (D) = max_{i, j} ϱ (x^{(i)}, x^{(j)}) .$

Note that the elements of the distance matrices of the clusters belong to the interval

I = [i n f (D), s u p (D)] .

3. Randomized Binary Clustering

The binary clustering problem is to arrange n objects between two clusters $K_{(s^{*})}$ and $K_{(n - s^{*})}$ of sizes $s^{*}$ and $(n - s^{*}),$ respectively:

(6) $\begin{matrix} K_{(s^{*})} = {i_{1}, \dots, i_{s^{*}}}, K_{(n - s^{*})} = {j_{1}, \dots, j_{(n - s^{*})}}; \\ (i_{1}, \dots, i_{s}) \neq (j_{1}, \dots, j_{(n - s^{*})}), (i_{α}, j_{β}) = \bar{1, n}, α = \bar{1, s^{*}}, β = \bar{1, n - s^{*}} . \end{matrix}$

It is required to find the size

s^{*}

and composition

{i_{1}, \dots, i_{s^{*}}}

of the cluster

K_{(s^{*})} .

For each fixed cluster size $s,$ the clustering procedure consists of selecting a submatrix $X_{(s)}$ of some $s < n$ rows from the matrix $X^{(1, \dots, n)} .$ If the matrix $X_{(s)}$ is selected, then the remaining rows form the matrix $X_{(n - s)}$ and the set of their numbers form the cluster $K_{(n - s)} .$

Clearly, the matrix $X_{(s)}$ can be formed from the rows of the original matrix $X^{(1, \dots, n)}$ in $C_{s}^{n}$ different ways (the number of s-combinations from the set of n elements). For each of them, the matrices $X_{(n - s)}$ can be formed in a corresponding number of ways.

According to the principle of randomized binary clustering, the matrix $X_{(s)}$ is a random object and its particular images are the realizations of this object. The sets of its elements and the number of rows s are therefore random.

A realization of the random object is a set of row vectors from the original matrix:

(7) $X_{(s)} = X_{(s)}^{(i_{1}, \dots, i_{s})} = (\begin{matrix} x^{(i_{1})} \\ \dots \\ x^{(i_{s})} \end{matrix}) .$

We renumber this set as follows:

(8) ${i_{1}, \dots, i_{s}} \to k = \bar{1, K (s)}; K (s) = C_{s}^{n} .$

Thus, the randomization procedure yields a finite ensemble of the form

(9) $X_{(s)} = \{X_{(s)}^{1}, \dots, X_{(s)}^{K (s)}\} .$

Recall that the matrices in this ensemble are random. Hence, we assume the existence of probabilities

p (s, k)

for realizing the ensemble elements, where s and k denote the cluster size and cluster realization number, respectively:

(10) $X_{(s)}^{(k)} w i t h p r o b a b i l i t y p (s, k), s = \bar{1, (n - 1)}, k = \bar{1, K} .$

Then, the randomized binary clustering problem reduces to determining a discrete probability distribution $p (s, k), (s = \bar{1, (n - 1)}, k = \bar{1, K})$ , which is appropriate in some sense.

Let such a function $p^{*} (s, k)$ be obtained; according to the general variational principle of statistical mechanics, the realized matrix will be

(11) $X_{(s^{*})}^{(k^{*} (s^{*}))}, w h e r e k^{*} (s^{*}) = max_{s, k} p^{*} (s, k) .$

This matrix corresponds to the most probable cluster of

s^{*}

objects with the numbers

(12) $K_{1}^{*} = {i_{1}^{*}, \dots, i_{s^{*}}^{*}} \to k^{*} (s^{*}) .$

The other cluster consists of the remaining

(n - s^{*})

objects with the numbers

(13) $K_{2}^{*} = {j_{1}^{*}, \dots, j_{(n - s^{*})}^{*}}, (j_{1}^{*}, \dots, j_{(n - s^{*})}^{*}) \neq (i_{1}^{*}, \dots, i_{s^{*}}^{*}) .$

Generally speaking, there are many such clusters but they all contain the same $(n - s^{*})$ objects.

4. Entropy-Optimal Distribution $p^{*} (s, k)$

Consider the cluster $K_{1}$ of size $s,$ the associated matrix

(14) $X_{(s)}^{(i_{1}, \dots, i_{s})} = (\begin{matrix} x^{(i_{1})} \\ \dots \\ x^{(i_{s})} \end{matrix}) = X_{(s)}^{(k)}, {i_{1}, \dots, i_{s}} \to k,$

and the distance matrix

(15) $D_{(s)}^{(i_{1}, \dots, i_{s})} = (\begin{matrix} 0 & ϱ^{(k)} (x^{i_{1}}, x^{i_{2}}) & \dots & ϱ^{(k)} (x^{i_{1}}, x^{i_{s}}) \\ \dots & \dots & \dots \\ ϱ^{(k)} (x^{i_{s}}, x^{i_{1}}) & ϱ^{(k)} (x^{i_{s}}, x^{i_{2}}) & \dots & 0 \end{matrix}) = D_{(s)}^{(k)} .$

We define the matrix indicator in (4) for the cluster

K_{s}

(16) $d i s (X_{(s)}^{(k)}) = \frac{2}{s (s - 1)} \sum_{(t, h) = 1, t \neq h}^{s} ϱ^{(k)} (x^{i_{t}}, x^{i_{h}}) .$

Since the matrices

X_{(s)}^{(k)}

are supposed random objects, their ensemble has a probability distribution

p (s, k)

. We introduce the average indicator in the form

(17) $M {d i s (X_{(s)}^{(k)})} = \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)}) .$

For determining the discrete probability distribution $p (s, k),$ we apply randomized machine learning with the Boltzmann–Shannon entropy functional [17]:

(18) $H_{B} [p (s, k)] = - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) ln p (s, k) \Rightarrow max$

subject to the constraints

(19) $0 \leq p (s, k) \leq 1, s = \bar{1, (n - 1)}, k = \bar{1, K (s)},$

(20) $i n f (D) \leq \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)}) \leq s u p (D) .$

Here, the lower

inf (D)

and upper

sup (D)

bounds for the elements of the distance matrix are given by (5); the indicator

d i s (X_{(s)}^{(k)})

is given by (16).

5. Parametrical Problems (18)–(20)

We treat Equations (18)–(20) as finite-dimensional: the objective function (entropy) and the constraints both depend on the finite-dimensional vector $p$ composed of the values of the two-dimensional probability distribution $p (s, k)$ :

(21) $p = {p (1, k \in \bar{1, K (1)}), \dots, p ((n - 1, k \in \bar{K (n - 2) + 1, K (n - 1)}))} .$

The dimension of this vector is

(22) $M = \sum_{s = 1}^{(n - 1)} K (s) .$

The constraints in (19) can be omitted by considering the Fermi entropy [18] as the objective function. Performing standard transformations, we arrive at a finite-dimensional entropy-linear programming problem [19] with the form

(23) $\begin{matrix} H (p) = - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) ln p (s, k) + (1 - p (s, k)) ln (1 - p (s, k)) \Rightarrow max_{0 \leq p (s, k) \leq 1}, \\ \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \bar{d i s} (X_{(s)}^{(k)}) \leq - 1, \bar{d i s} (X_{(s)}^{(k)}) = - \frac{d i s (X_{(s)}^{(k)})}{inf (D)}, \\ \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \underset{̲}{d i s} (X_{(s)}^{(k)}) \leq 1, \underset{̲}{d i s} (X_{(s)}^{(k)}) = \frac{d i s (X_{(s)}^{(k)})}{sup (D)} . \end{matrix}$

To solve this problem, we employ the Karush–Kuhn–Tucker theorem [20], expressing the optimality conditions in terms of Lagrange multipliers and a Lagrange function. For Equation (23), the Lagrange function has the form

(24) $\begin{matrix} L [p, λ_{1}, λ_{2}] & = & H (p) + λ_{1} (- 1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \bar{d i s} (X_{(s)}^{(k)})) + \\ + & λ_{2} (1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) \underset{̲}{d i s} (X_{(s)}^{(k)})) . \end{matrix}$

The optimality conditions for the saddle point of the Lagrange function in (24) are written as

(25) $\nabla_{p} L (p^{*}, λ_{1}^{*}, λ_{2}^{*}) = 0, \frac{\partial L (p^{*}, λ_{1}^{*}, λ_{2}^{*})}{\partial λ_{i}} \geq 0,$

(26) $λ_{i} \frac{\partial L (p^{*}, λ_{1}^{*}, λ_{2}^{*})}{\partial λ_{i}} = 0, λ_{i} \geq 0, i = 1, 2 .$

The first condition in (25) is analytically solvable with respect to the components of the vector

p

(27) $\begin{matrix} p^{*} (s, k | λ_{1}, λ_{2}) & = & \frac{exp (- λ_{1} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2} \underset{̲}{d i s} (X_{(s)}^{(k)}))}, \\ s & = & \bar{1, (n - 1)}, k = \bar{1, K (s)} . \end{matrix}$

The second condition in (25) yields the inequalities

(28) $\begin{matrix} L_{λ_{1}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = - 1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) \bar{d i s} (X_{(s)}^{(k)}) \geq 0, \\ L_{λ_{2}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = 1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) \underset{̲}{d i s} (X_{(s)}^{(k)}) \geq 0, \end{matrix}$

and the condition in (26) yields the following equations:

(29) $\begin{matrix} λ_{1}^{*} L_{λ_{1}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = 0, \\ λ_{2}^{*} L_{λ_{2}} (p^{*} (s, k | λ_{1}^{*}, λ_{2}^{*}) = 0, \\ λ_{1}^{*} \geq 0, λ_{2}^{*} \geq 0 . \end{matrix}$

The non-negative solution of these inequalities and equations can be found using a multiplicative algorithm [19] with the form

(30) $\begin{matrix} λ_{1}^{q + 1} = λ_{1}^{q} (1 + γ L_{λ_{1}} (p^{*} (s, k | λ_{1}^{q}, λ_{2}^{q})), \\ λ_{2}^{q + 1} = λ_{2}^{q} (1 + γ L_{λ_{2}} (p^{*} (s, k | λ_{1}^{q}, λ_{2}^{q})), (λ_{1}^{0}, λ_{2}^{0}) > 0 . \end{matrix}$

Here,

γ > 0

is a parameter assigned based on the

G

-convergence conditions of the iterative process in (30).

The algorithm in (30) is said to be $G$ -convergent if there exists a set $G$ in the space $R_{+}^{2}$ and scalars $a (G)$ and $γ$ such that, for all $(λ_{1}^{0}, λ_{2}^{0}) \in G$ and $0 < γ \leq a (G),$ this algorithm converges to the solution $(λ_{1}^{*}, λ_{2}^{*})$ of Equation (30), and the rate of convergence in the neighborhood of $(λ_{1}^{*}, λ_{2}^{*})$ is linear.

Theorem 1.

The algorithm in (30) is $G$ -convergent to the solution of Equation (29).

Proof.

Consider an auxiliary system of differential equations obtained from (30) as $γ \to 0$ :

(31) $\frac{d λ_{i}}{d t} = λ_{i} L_{λ_{i}} (p^{*} (s, k | λ_{1}, λ_{2})), i = 1, 2 .$

First, we have to establish its stability in the large, i.e., under any initial deviations in the space

R_{+}^{2} .

Second, we have to demonstrate that the algorithm in (30) is a Euler difference scheme for Equation (31) with an appropriate value $γ .$

Let us describe some details of the proof. We define the following function in $R_{+}^{2}$ :

$V (λ_{1}, λ_{2}) = - \sum_{i = 1}^{2} λ_{i}^{*} (ln λ_{i} - ln λ_{i}^{*}) .$

The function is strictly convex in

R_{+}^{2}

. Its Hessian is

$Γ = d i a g [\frac{λ_{i}^{*}}{λ_{i}^{2}} | i = 1, 2] \geq 0 .$

Hence,

{min}_{R_{+}^{2}} V (λ_{1}, λ_{2}) = 0

and it is achieved at the point

(λ_{1}^{*}, λ_{2}^{*})

. Thus,

V (λ_{1}, λ_{2}) > 0, (λ_{1}, λ_{2}) \in R_{+}^{2}

, and

V (λ_{1}^{*}, λ_{2}^{*}) = 0

We define the time derivative along the trajectories of (31):

$\frac{d V}{d t} = - λ_{1}^{*} L_{λ_{1}} (p^{*} (s, k) | λ_{1}, λ_{2}) - λ_{2}^{*} L_{λ_{2}} (p^{*} (s, k) | λ_{1}, λ_{2}) .$

According to (28) on

R_{+}^{2}

$\frac{d V}{d t} = \{\begin{matrix} < 0 if λ_{1}^{*} > 0, λ_{2}^{*} > 0 \\ = 0 if λ_{1}^{*} = λ_{2}^{*} = 0 . \end{matrix}$

Hence, the function V is a Lyapunov function for Equation (31) in the space

R_{+}^{2}

. All solutions of Equation (31) are asymptotically stable under any initial conditions

λ_{1}^{0} > 0

and

λ_{2}^{0} > 0 .

The algorithm in (30) is a Euler difference scheme. Due to the asymptotic stability of the solutions of (31), there always exists a step $γ > 0$ and an initial condition domain under which the Euler scheme will converge. □

By the general principle of statistical mechanics, the realized cluster corresponds to the maximum of the probability distribution:

(32) $K_{s^{*}, k^{*}} \Rightarrow (s^{*}, k^{*}) = max_{s, k} p^{*} (s, k) .$

5.1. Randomized Binary Clustering Algorithms

In this section, the algorithms of the clustering procedures in terms of logical schemes are proposed.

5.1.1. Algorithm $R 2 K (s)$ with a Given Cluster Size s

1.. Calculating the numerical characteristics of the data matrix $X^{(1, \dots, n)}$
- (a). Constructing the row vectors
  $x^{(1)} = {x_{11}, \dots, x_{1 n}}, \dots, x^{(n)} = {x_{n 1}, \dots, x_{n n}} .$
- (b). Calculating the elements of the (Euclidean) distance matrix $D_{(n \times n)}$ in (3):
  $ϱ (x^{(i)}, x^{(j)}) = ϱ_{i j} = \sqrt{\sum_{(k, l)}^{m} {(x_{k}^{(i)} - x_{l}^{(j)})}^{2}}, (i, j) = \bar{1, n} .$
- (c). Calculating the data matrix indicator in (4):
  $d i s (X) = \frac{2}{n (n - 1)} \sum_{(i j) = 1}^{n} ϱ_{i, j} .$
- (d). Calculating the upper and lower bounds for the elements of the matrix $D_{(n \times n)}$ in (3):
  $inf (D) = min_{i j} ϱ_{i j}, sup (D) = max_{i j} ϱ_{i j} .$
2.. Forming the matrix ensemble $X_{(s)}^{(i_{1}, \dots, i_{s})}$
- (a). Forming the correspondence table
  $i_{1}, \dots, i_{s} \to k, i_{j} = \bar{1, n}, j = \bar{1, s}; k = \bar{1, K (s)}, K (s) = C_{s}^{n} .$
- (b). Constructing the matrices $X^{(i_{1}, \dots, i_{s})} .$
- (c). Calculating the elements of the distance matrices $D_{(s)}^{(k)}$ in (15):
  $ϱ^{(k)} (x^{i_{h}}, x^{i_{q}}) = \sqrt{\sum_{(i_{k}, i_{q}) = 1}^{m} {(x^{i_{h}} - x^{i_{q}})}^{2}}, i_{k} \neq i_{q} .$
- (d). Calculating the indicator of the matrix $X_{(s)}^{(k)}$ :
  $d i s (X_{(s)}^{(k)}) = \frac{2}{s (s - 1)} \sum_{i_{h}, i_{q}}^{s} ϱ_{i_{h}, i_{q}} .$
3.. Determining the Lagrange multipliers $λ_{1}$ and $λ_{2}$ for the finite-dimensional problem
- (a). Specifying the initial values for the Lagrange multipliers:
  $λ_{1}^{(0)} > 0, λ_{2}^{(0)} > 0 .$
- (b). Applying the iterative algorithm in (30):
  $λ_{1}^{q + 1} = λ_{1}^{q} (1 - γ [1 + \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} \frac{exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))} \bar{d i s} (X_{(s)}^{(k)})]),$
  
  $λ_{2}^{q + 1} = λ_{2}^{q} (1 + γ [1 - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} \frac{exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{q} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{q} \underset{̲}{d i s} (X_{(s)}^{(k)}))} \underset{̲}{d i s} (X_{(s)}^{(k)})]),$
  where
  $\bar{d i s} (X_{(s)}^{(k)}) = - \frac{d i s (X_{(s)}^{(k)})}{inf (X_{(s)}^{(k)})}, \underset{̲}{d i s} (X_{(s)}^{(k)}) = \frac{d i s (X_{(s)}^{(k)})}{sup (X_{(s)}^{(k)})} .$
- (c). Determining the optimal probability distribution:
  $p^{*} (k | s, λ_{1}^{*}, λ_{2}^{*}) = \frac{exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))} .$
- (d). Determining the most probable cluster $K_{1}^{*}$ :
  $k^{*} = arg max p^{*} (k | s, λ_{1}^{*}, λ_{2}^{*}), K_{1}^{*} = {i_{1}^{*}, \dots, i_{s}^{*}} \to k^{*} (s) .$
- (e). Determining the cluster $K_{2}^{*}$ :
  $K_{2}^{*} = {j_{1}^{*}, \dots, j_{(n - s)}^{*}}, (j_{1}^{*}, \dots, j_{(n - s)}^{*}) \neq (i_{1}^{*}, \dots, i_{s}^{*}) .$

5.1.2. Algorithm $R 2 K$ with an Unknown Cluster Size $s \in [1, (n - 1)]$

1.. Applying step 1 of $R 2 K (s)$
$ϱ (x^{(i)}, x^{(j)}), (i, j) = \bar{1, n}; d i s (X), inf (D), sup (D) .$
2.. Organizing a loop with respect to the cluster size $s = \bar{1, (n - 1)}$
- (a). Applying step 2 of $R 2 K (s)$
  $i_{1}, \dots, i_{s} \to k, i_{j} = \bar{1, n}, j = \bar{1, s}; k = \bar{1, K (s)}, K (s) = C_{s}^{n} .$
  
  $ϱ^{(k)} (x^{i_{h}}, x^{i_{q}}) = \sqrt{\sum_{(i_{k}, i_{q}) = 1}^{m} {(x^{i_{h}} - x^{i_{q}})}^{2}}, i_{k} \neq i_{q} .$
  
  $d i s (X_{(s)}^{(k)}) = \frac{2}{s (s - 1)} \sum_{i_{h}, i_{q}}^{s} ϱ_{i_{h}, i_{q}} .$
- (b). Applying step 3 of $R 2 K (s)$
  $p^{*} (s, k) = \frac{exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))}{1 + exp (- λ_{1}^{*} \bar{d i s} (X_{(s)}^{(k)}) - λ_{2}^{*} \underset{̲}{d i s} (X_{(s)}^{(k)}))} .$
- (c). Putting $p^{*} (s, k)$ into the memory.
- (d). Calculating the conditionally maximum value of the entropy:
  $H_{B}^{*} [p^{*} (s, k)] = - \sum_{k = 1}^{K (s)} p^{*} (s, k) ln p^{*} (s, k) = H^{*} (s) .$
- (e). Putting $H (s)$ in the memory.
- (f). If $s < n - 1$ , then returning to Step 2a.
- (g). Determining the maximum element of the array $H (s), s = \bar{1, (n - 1)}$ :
  $s^{*} = arg max_{1 \leq s \leq (n - 1)} H (s) .$
- (h). Extracting the probability distribution
  $p (s^{*}, k), k = \bar{1, K (s^{*})} .$
- (i). Executing Steps 3d and 3e of $R 2 K (s)$ :
  $k^{*} (s^{*}) = arg max p (s^{*}, k), K_{1}^{*} = {i_{1}^{*}, \dots, i_{s^{*}}^{*}} \to k^{*} (s^{*}) .$
  
  $K_{2}^{*} = {j_{1}, \dots, j_{n - s^{*}}}, (i_{1}^{*}, \dots, i_{s^{*}}^{*}) \neq (j_{1}, \dots, j_{n - s^{*}}) .$

6. Functional Problems (18)–(20)

Consider a parametric family of all constrained entropy maximization problems with the form

(33) $\begin{matrix} H [p (s, k), ε] = - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} [p (s, k) ln p (s, k) + (1 - p (s, k)) ln (1 - p (s, k))] \Rightarrow max, \\ \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)}) = inf (D) + ε (sup (D) - inf (D)) = Δ (ε), \\ 0 \leq ε \leq 1 . \end{matrix}$

The solutions of (22) and (23) will coincide under an unknown value of the parameter

ε .

It can be determined by solving Equation (23) and fixing the values of the entropy functional. Its maximum value will correspond to the desired value

ε^{*} .

Let us turn to Equation (23) with a fixed value of the parameter $ε .$ It belongs to the class of Lyapunov-type problems [21]. We define a Lagrange functional as

(34) $L [p (s, k), ε, λ] = H [p (s, k), ε] + λ (Δ (ε) - \sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} p (s, k) d i s (X_{(s)}^{(k)})) .$

Using the technique of Gâteaux derivatives, we obtain the stationarity conditions for the functional in (24) in the primal (functional)

p (s, k)

and dual (scalar)

λ

variables; for details, see [22,23]. The resulting optimal distribution parameterized by the Lagrange multiplier

λ

is given by

(35) $p^{*} (s, k | λ (ε)) = \frac{exp (- λ d i s (X_{(s)}^{(k)}))}{1 + exp (- λ d i s (X_{(s)}^{(k)}))} .$

The Lagrange multiplier

λ

satisfies the equation

(36) $\sum_{s = 1}^{n - 1} \sum_{k = 1}^{K (s)} \frac{exp (- λ d i s (X_{(s)}^{(k)}))}{1 + exp (- λ d i s (X_{(s)}^{(k)}))} d i s (X_{(s)}^{(k)}) = Δ (ε) .$

The solution $λ^{*} (ε)$ of this equation belongs to $(- \infty, + \infty)$ and depends on $ε .$ Hence, the value of the entropy functional $H [p^{*} (s, k | ε), ε]$ depends on $ε .$ We choose

(37) $ε^{*} = arg max_{ε} H [p^{*} (s, k | ε), ε] .$

The randomized binary clustering procedure can be repeated $t / 2$ times to form t clusters. At each stage, two new clusters are generated from the remaining objects of the previous stage.

7. Illustrative Examples

Consider the binary clustering of iris flowers using Fisher’s Iris dataset (in this dataset, iris flowers are described by the petal width $x_{1}$ and the petal length $x_{2}$ ). The database contains this feature information for three types of flowers: “setosa” (1), “versicolor” (2), and “virginica” (3), in the amount of 50 two-dimensional points for each species. Below, we study types 1 and 2 and 10 data points for each type.

Example 1.

The data matrix contains the numerical values of the two features for types 1 and 2; see Table 1.

Figure 1 shows the arrangement of the data points on the plane.

First, we apply the algorithm $R 2 K (10);$ see Section 5.1.

The minimum and maximum elements are

(38) $inf (D_{(20 \times 20)}) = 0, sup (D_{(20 \times 20)}) = 3.73 .$

The data matrix indicator is $d i s X = 1.7382 .$ Let $ε = 0.15$ .

The ensemble of possible clusters has the size $K (10) = 184786$ . The cluster with number $k = 256$ has the form $i_{1} = 1, i_{2} = 2, i_{3} = 3, i_{4} = 4, i_{5} = 5, i_{6} = 6, i_{7} = 7, i_{8} = 14,$ $i_{9} = 15, i_{10} = 20$ . The distance matrix $D_{(10)}^{(256)}$ is presented in Table 2.

The indicator of the matrix $X_{(10)}^{(256)}$ corresponding to the cluster $K_{(10)}^{(256)}$ is

(39) $d i s (X_{(10)}^{(256)}) = 1.5021 .$

The indicators for the clusters $k = 1, \dots, 184786$ are shown in Figure 2.

The entropy-optimal probability distribution for $s = 10$ has the form

(40) $p^{*} (k | 10) = \frac{exp (- λ^{*} d i s (X_{(10)}^{(k)}))}{1 + exp (- λ^{*} d i s (X_{(10)}^{(k)}))}, λ^{*} = 12.1153 .$

The cluster $K_{1}$ with the maximum probability is numbered by

(41) $k^{*} = 166922, K_{1} = {4, 5, 6, 7, 11, 14, 15, 16, 17, 20}, d i s K_{1} = 0.1354 .$

The cluster $K_{2}$ consists of the following data points: ${1, 2, 3, 8, 9, 10, 12, 13, 18, 19}$ .

The arrangement of the clusters $K_{1}$ and $K_{2}$ is shown in Figure 3.

A direct comparison with Figure 1 indicates a perfect match of 10/10: no clustering errors.

Example 2.

Consider another data matrix from the same dataset (Table 3).

Figure 4 shows the arrangement of the data points.

Similar to Example 1, we apply the algorithm $R 2 K (10) .$

We construct the distance matrix $D_{(20 \times 20)}$ and find the minimum and maximum elements:

(42) $inf (D_{(20 \times 20)}) = 0, sup (D_{(20 \times 20)}) = 3.73 .$

Let $ε = 0.15$ .

The ensemble of possible clusters has the size $K (10) = 184786$ . The indicators for the clusters $k = 1, \dots, 184786$ are shown in Figure 5.

The entropy-optimal probability distribution for $s = 10$ has the form

(43) $p^{*} (k | 10) = \frac{exp (- λ^{*} d i s (X_{(10)}^{(k)}))}{1 + exp (- λ^{*} d i s (X_{(10)}^{(k)}))}, λ^{*} = 100 .$

The cluster $K_{1}$ with the maximum probability is numbered by

(44) $k^{*} = 177570, K_{1} = {5, 6, 7, 8, 11, 14, 15, 16, 17, 20}, d i s K_{1} = 0.4420 .$

It consists of the following data points: ${5, 6, 7, 8, 11, 14, 15, 16, 17, 20} .$ The cluster $K_{2}$ consists of the following data points: ${1, 2, 3, 4, 9, 10, 12, 13, 18, 19} .$

The arrangement of the clusters $K_{1}$ and $K_{2}$ is shown in Figure 6. A direct comparison with Figure 4 indicates a match of 8/10.

8. Discussion and Conclusions

This paper has developed a novel concept of clustering. Its fundamental difference from the conventional approaches is the generation of an ensemble of random clusters, accompanied by the matrices of inter-object distances averaged over the entire ensemble (the so-called indicators). Random clusters are parameterized by the number of objects s and their set $k \to {i_{1}, \dots, i_{s}}$ . Therefore, the ensemble’s characteristic is the probability distribution of the clusters in the ensemble, which depends on s and k. A generalized variational principle of statistical mechanics has been proposed to find this distribution. It consists of the conditional maximization of the Boltzmann–Shannon entropy. Algorithms for solving the finite-dimensional and functional optimization problems have been developed.

An advantage of the novel randomized clustering method is complete algorithmization, independent of the properties of the clustered objects data. All existing clustering methods involve, more or less, various empirical techniques related to data properties.

However, this method requires high computational resources to form an ensemble of random clusters and their indicators.

Author Contributions

Conceptualization, Y.S.P.; Data curation, A.Y.P.; Methodology, Y.S.P., A.Y.P., and Y.A.D.; Software, A.Y.P. and Y.A.D.; Supervision, Y.S.P.; Writing—original draft, Y.S.P., A.Y.P. and Y.A.D. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. Data points on the two-dimensional plane.

Figure 2. Indicators for [Forumla omitted. See PDF.].

Figure 3. Randomized clustering results.

Figure 4. Data points on the two-dimensional plane.

Figure 5. Indicators for [Forumla omitted. See PDF.].

Figure 6. Randomized clustering results.

Table 1

Data matrix.

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

Table 2

Distance matrix for cluster $K (256)$ .

No.	1	2	3	4	5	6	7	8	9	10
1	0	0.1	0.22	3.01	3.45	3.32	3.27	3.36	3.36	3.36
2	0.1	0	0.14	3.1	3.55	3.42	3.36	3.45	3.45	3.45
3	0.22	0.14	0	3.16	3.61	3.48	3.42	3.51	3.51	3.51
4	3.01	3.1	3.16	0	0.45	0.32	0.28	0.36	0.36	0.36
5	3.45	3.55	3.61	0.45	0	0.14	0.2	0.1	0.1	0.1
6	3.32	3.42	3.48	0.32	0.14	0	0.14	0.1	0.1	0.1
7	3.27	3.36	3.42	0.28	0.2	0.14	0	0.1	0.1	0.1
8	3.36	3.45	3.51	0.36	0.1	0.1	0.1	0	0	0
9	3.36	3.45	3.51	0.36	0.1	0.1	0.1	0	0	0
10	3.36	3.45	3.51	0.36	0.1	0.1	0.1	0	0	0

Table 3

Data matrix.

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1

References

1. Mandel, I.D. Klasternyi Analiz (Cluster Analysis); Finansy i Statistika: Moscow, Russia, 1988.

2. Zagoruiko, N.G. Kognitivnyi Analiz Dannykh (Cognitive Data Analysis); GEO: Novosibirsk, Russia, 2012.

3. Zagoruiko, N.G.; Barakhnin, V.B.; Borisova, I.A.; Tkachev, D.A. Clusterization of Text Documents from the Database of Publications Using FRiS-Tax Algorithm. Comput. Technol.; 2013; 18, pp. 62-74.

4. Jain, A.; Murty, M.; Flynn, P. Data Clustering: A Review. ACM Comput. Surv.; 1990; 31, pp. 264-323. [DOI: https://dx.doi.org/10.1145/331499.331504]

5. Vorontsov, K.V. Lektsii po Algoritmam Klasterizatsii i Mnogomernomu Shkalirovaniyu (Lectures on Clustering Algorithms and Multidimensional Scaling); Moscow State University: Moscow, Russia, 2007.

6. Lescovec, J.; Rajaraman, A.; Ullman, J. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2014.

7. Deerwester, S.; Dumias, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci.; 1999; 41, pp. 391-407. [DOI: https://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9]

8. Zamir, O.E. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. Ph.D. Thesis; The Univeristy of Washington: Seattle, WA, USA, 1999.

9. Cao, G.; Song, D.; Bruza, P. Suffix-Tree Clustering on Post-retrieval Documents Information; The Univeristy of Queensland: Brisbane, QLD, Australia, 2003.

10. Huang, D.; Wang, C.D.; Lai, J.H.; Kwoh, C.K. Toward multidiversified ensemble clustering of high-dimensional data: From subspaces to metrics and beyond. IEEE Trans. Cybern.; 2021; [DOI: https://dx.doi.org/10.1109/TCYB.2021.3049633] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33961570]

11. Khan, I.; Luo, Z.; Shaikh, A.K.; Hedjam, R. Ensemble clustering using extended fuzzy k-means for cancer data analysis. Expert Syst. Appl.; 2021; 172, 114622. [DOI: https://dx.doi.org/10.1016/j.eswa.2021.114622]

12. Jain, A.; Dubs, R. Clustering Methods and Algorithms; Prentice-Hall: Hoboken, NJ, USA, 1988.

13. Pal, N.R.; Biswas, J. Cluster Validation Using Graph Theoretic Concept. Pattern Recognit.; 1997; 30, pp. 847-857. [DOI: https://dx.doi.org/10.1016/S0031-3203(96)00127-6]

14. Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On Clustering Validation Techniques. J. Intell. Inf. Syst.; 2001; 17, pp. 107-145. [DOI: https://dx.doi.org/10.1023/A:1012801612483]

15. Han, J.; Kamber, M.; Pei, J. Data Mining Concept and Techniques; Morgan Kaufmann Publishers: Burlington, MA, USA, 2012.

16. Popkov, Y.S. Randomization and Entropy in Machine Learning and Data Processing. Dokl. Math.; 2022; 105, pp. 135-157. [DOI: https://dx.doi.org/10.1134/S1064562422030073]

17. Popkov, Y.S.; Dubnov, Y.A.; Popkov, A.Y. Introduction to the Theory of Randomized Machine Learning. Learning Systems: From Theory to Practice; Sgurev, V.; Piuri, V.; Jotsov, V. Springer International Publishing: Cham, Switzerland, 2018; pp. 199-220. [DOI: https://dx.doi.org/10.1007/978-3-319-75181-8_10]

18. Popkov, Y.S. Macrosystems Theory and Its Applications (Lecture Notes in Control and Information Sciences Vol 203); Springer: Berlin, Germany, 1995.

19. Popkov, Y.S. Multiplicative Methods for Entropy Programming Problems and their Applications. Proceedings of the 2010 IEEE International Conference on Industrial Engineering and Engineering Management; Xiamen, China, 29–31 October 2010; pp. 1358-1362. [DOI: https://dx.doi.org/10.1109/IEEM.2010.5674404]

20. Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987.

21. Joffe, A.D.; Tihomirov, A.M. Teoriya Ekstremalnykh Zadach (Theory of Extreme Problems); Nauka: Moscow, Russia, 1974.

22. Tihomirov, V.M.; Alekseev, V.N.; Fomin, S.V. Optimal Control; Nauka: Moscow, Russia, 1979.

23. Popkov, Y.; Popkov, A. New methods of entropy-robust estimation for randomized models under limited data. Entropy; 2014; 16, pp. 675-698. [DOI: https://dx.doi.org/10.3390/e16020675]

Word count: 3532

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. The indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds to the maximum of the optimal probability distribution. This method is developed for binary clustering as a basic procedure. Its extension to t-ary clustering is considered. Some illustrative examples of entropy-randomized clustering are given.

Details

Title

Entropy-Randomized Clustering

Author

Popkov, Yuri S; Dubnov, Yuri A; Alexey Yu Popkov

First page

3710

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math10193710

ProQuest document ID

2724262813

Entropy-Randomized Clustering

Jump to:

Full text

Abstract

Details

Suggested sources

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1

No.	$x_{1}$	$x_{2}$	Type
1	4.5	1.5	2
2	4.6	1.5	2
3	4.7	1.4	2
4	1.7	0.4	1
5	1.3	0.2	1
6	1.4	0.3	1
7	1.5	0.2	1
8	3.9	1.4	2
9	4.5	1.3	2
10	4.6	1.3	2
11	1.4	0.2	1
12	4.7	1.6	2
13	4.0	1.3	2
14	1.4	0.2	1
15	1.4	0.2	1
16	1.5	0.2	1
17	1.5	0.1	1
18	4.9	1.5	2
19	3.3	1.0	2
20	1.4	0.2	1

No.	$x_{1}$	$x_{2}$	Type
1	6.4	3.2	2
2	6.5	2.8	2
3	7.0	3.2	2
4	5.4	3.9	1
5	4.7	3.2	1
6	4.6	3.4	1
7	4.6	3.1	1
8	5.2	2.7	2
9	5.7	2.8	2
10	6.6	2.9	2
11	5.1	3.5	1
12	6.3	3.3	2
13	5.5	2.3	2
14	4.4	2.9	1
15	4.9	3.0	1
16	5.0	3.4	1
17	4.9	3.1	1
18	6.9	3.1	2
19	4.9	2.4	2
20	5.0	3.6	1