Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Clustering is a set of unsupervised algorithms, which plays an important role in machine learning and data mining. Traditional clustering algorithms can be divided into five categories [1–3]: partitioning, hierarchical, density-based, grid-based, and model-based. K-means is one of the most popular and simplest partition-based clustering algorithms, which has been widely used in many fields [4, 5]. However, K-means algorithm has its own drawbacks, and then many scholars have proposed improvements [6]. It is mainly concentrated in two ways [7–12]: selection of the initial clustering center and determination of the number of clusters K.

In 2014, Rodriguez and Laio proposed the density peak clustering (DPC) [13] algorithm. It suggests that the cluster centers are always surrounded by those data points with low local density, whereas it is far away from those data points with high local density. The DPC algorithm is of great significance to the selection of initial cluster centers and the determination of K-values, but it also has limitations [14–16]. A new strategy [17] attempts to automatically determine the cutoff distance by information entropy, which is based on data potential energy in the data field instead of the local density between data points. There is an approach [18] determining the initial cluster centers according to an improved DPC algorithm, meanwhile using entropy to calculate the weighted Euclidean distance between data points to optimize the K-means algorithm. A DPCSA algorithm [19] is presented which improves the DPC algorithm based on K-nearest neighbor (KNN) algorithm in addition to weighted local density sequence. The fuzzy weighted KNN is used to calculate the local density of data points [20], and it employs two assignment strategies to classify data points which improves the robustness of this clustering algorithm. The DPC-DLP algorithm [21] uses the KNN idea to calculate the global cutoff distance and the local density of the data points, and uses the graph-based label propagation algorithm to assign data points to clusters. The PageRank algorithm can be used to calculate the local density of data points [22], which can avoid the instability of clustering result caused by the cutoff distance. The CFSFDP-HD algorithm [23] uses heat diffusion to calculate the local density of data points to reduce the influence of cutoff distance on clustering result. The ADPC-KNN algorithm [24] is proposed based on the KNN to calculate the local density and automatically selects the initial cluster centers. The DPADN algorithm [25] uses a continuous function to redesign the local density, which can automatically determine the cluster centers. A multi-feature fusion with adaptive graph learning model [26] is developed, which is an unsupervised algorithm to address the issue of person reidentification. An adaptive approach is constructed [27] which is effective in simultaneously learning the affinity graph and feature fusion, resulting in better clustering results. The novel RCFE method [28] is proposed where the number of clusters is guaranteed to converge to the ground truth via a rank constraint on the Laplacian matrix. Finally, there are some other recent work involving density peak and nearest neighbors that are worth investigating [29–36].

Inspired by the ideas of DPC and KNN, this paper proposes an improved clustering algorithm. First of all, using the sum of the squared errors (SSE) to determine the optimal number of clusters—the K-value, then we select K initial cluster centers based on an improved DPC algorithm. In addition, K-means algorithm is applied to carry out iteration, and data points are divided into core points and outliers according to the average distance within each cluster. Finally, combined with the nearest neighbor idea of KNN, outliers are assigned by voting. Experimental results show that our proposed algorithm can achieve better clustering effect in most cases, compared with several known clustering algorithms.

The remainder of this paper is organized as follows. Section 2 reviews the DPC algorithm and introduces several improved methods. Section 3 provides our K-value selection method. Section 4 describes the proposed algorithm in detail. Experimental results are presented and discussed in Section 5. The conclusion is stated in Section 6.

2. Related Work

2.1. The DPC Algorithm

The DPC algorithm indicates that cluster centers are characterized by a higher density than their neighbors and a relatively large distance from data points with higher densities. In this approach, for each data point $x_{i}$ , two attributes need to be calculated: local density $ρ_{i}$ and distance $σ_{i}$ .

There are two ways for calculating the local density $ρ_{i}$ of $x_{i}$ , namely, cutoff kernel and Gaussian kernel. The cutoff kernel is often used when data size is large, and it is defined as follows: $\begin{matrix} (1) & ρ_{i} = \sum_{j} χ d_{i j} - d_{c}, \\ χ x = \begin{cases} 1, & x < 0, \\ 0, & x \geq 0, \end{cases} \end{matrix}$ where $d_{i j}$ is the distance between data points $x_{i}$ and $x_{j}$ , and the cutoff distance $d_{c} > 0$ which needs to be set manually. Usually arranging all distance $d_{i j}$ in ascending order, a value between the first 1% and 2% is selected as the cutoff distance. When the data size is small, the Gaussian kernel is used as $\begin{matrix} (2) & ρ_{i} = \sum_{j} e^{- {d_{i j} / d_{c}}^{2}} . \end{matrix}$

Comparing the two local density calculation methods, it is found that local density obtained by the cutoff kernel is a discrete value, while the other is a continuous value. The use of the cutoff kernel may cause conflicts when calculating the local density of different data points; therefore, the Gaussian kernel is more generally used to calculate the local density.

The distance $σ_{i}$ of data point $x_{i}$ is measured by computing the minimum distance between the point $x_{i}$ and any other point with higher density: $\begin{matrix} (3) & σ_{i} = \min_{j : ρ_{j} > ρ_{i}} d_{i j} . \end{matrix}$

For the point with highest density, it conventionally takes $\begin{matrix} (4) & σ_{i} = \max_{j} d_{i j} . \end{matrix}$

Assume that the original data set is shown in Figure 1. The local density $ρ_{i}$ and distance $σ_{i}$ are calculated for each data point $x_{i}$ , and a decision graph is illustrated in Figure 2. According to the DPC algorithm, data points that have both large $ρ$ and $σ$ are most likely to be the cluster centers, which are located in the upper right in Figure 2.

[figure(s) omitted; refer to PDF]

In order to determine the cluster centers and its number clearly, the quantity $γ$ in [13] which is a comprehensive consideration of $ρ$ and $σ$ is defined as $\begin{matrix} (5) & γ_{i} = ρ_{i} σ_{i} . \end{matrix}$

According to $γ_{i}$ and number of data points, a new decision graph is shown in Figure 3. Obviously, the larger the value $γ$ , the more likely the corresponding data point to be the cluster center. It can be seen from Figure 3 that $γ$ of noncluster center is relatively smooth, and there is an obvious jump of the value $γ$ between the cluster center and the noncluster center. It is demonstrated that the data point above the dotted line is the cluster center. Consequently, the remaining data points are assigned to the cluster which has the nearest point with larger local density.

[figure(s) omitted; refer to PDF]

2.2. Set the Cutoff Distance

It is not advisable to manually set the cutoff distance $d_{c}$ based on experiences for diverse data structures. It affects the local density of data points, which in turn leads to different clustering results.

Assuming p is a percentage, we define $\begin{matrix} (6) & d_{c} = dis N * p . \end{matrix}$

N is the number of distances between any two data points, and dis is an ascending order of these distances. Here, it takes the N $^{*}$ p position of dis sequence as the value of the cutoff distance $d_{c}$ . Figure 4 shows clustering results by running the DPC algorithm in terms of different percentage p.

[figure(s) omitted; refer to PDF]

We can see that the cutoff distance $d_{c}$ obtained at different percentage p has a huge impact on clustering results. Hence, the setting of the cutoff distance should be flexible, and an appropriate $d_{c}$ should be selected to certain data structure.

In information theory, Shannon entropy is used to measure the uncertainty of a system. The greater the entropy, the greater the uncertainty [37]. Similarly, the uncertainty of data distribution can be expressed by entropy. Suppose that the local density is $\begin{matrix} (7) & ρ_{i} = \sum_{j} e^{- {d_{i j} / σ}^{2}}, \end{matrix}$ $σ > 0$ is an impact factor which is used to optimize the cutoff distance $d_{c}$ . For a data set $X = x_{1}, x_{2}, \dots, x_{n}$ , the local density of each point is $ρ_{1}, ρ_{2}, \dots, ρ_{n}$ . Considering the information entropy to evaluate the rationality of local density estimation [38], $\begin{matrix} (8) & H = - \sum_{i = 1}^{n} \frac{ρ_{i}}{Z} \log \frac{ρ_{i}}{Z}, \\ (9) & Z = \sum_{i = 1}^{n} ρ_{i} . \end{matrix}$

Z is the normalized factor. The relationship between information entropy H and impact factor $σ$ is shown in Figure 5. We can see that when $σ$ starts to increase from 0, the information entropy H first decreases rapidly, then increases slowly, and finally tends to be stable. It follows that $σ$ is the optimal value when the entropy is the lowest. According to the 3B rule of Gaussian distribution in [39], the data point has a radius of influence on other points which is $3 σ / \sqrt{2}$ . Similarly, we think that data points can only affect points within its radius in clustering algorithm [40]. Therefore, we set the cutoff distance as $\begin{matrix} (10) & d_{c} = \frac{3}{\sqrt{2}} σ . \end{matrix}$

[figure(s) omitted; refer to PDF]

2.3. Optimize the Local Density

The DPC algorithm only considers the global structure of data set, and the clustering result is not good enough when it comes to unevenly distributed data sets [41]. When the data distribution is relatively concentrated and the density of each cluster is quite different, the change of local density will lead to the difficulty of selecting the correct cluster centers, thus affecting the final clustering results. Based on the idea of KNN algorithm, the nearest neighbors are introduced into local density calculation. The closer the data point is to the target point, the more contribution it makes to the local density of the target point. The new local density is defined as follows: $\begin{matrix} (11) & ρ_{i} = \sum_{j = 1}^{m} e^{- {d_{i j}^{'} / d_{c}}^{2}} + \sum_{j = m + 1}^{n} \frac{e^{- d_{i j^{'}}^{}}}{{j - m}^{2}}, \end{matrix}$ where $m = 2 * K$ , $d_{i j}^{'}$ represents the ascending order of the distances between data point i and other data points, n is the number of data points in data set, m is the number of nearest neighbors of data point i, and K denotes the number of clusters.

3. The K-Value Selection Method

In the DPC algorithm, the boundary between cluster center point and noncluster center point may be unclear, which cause different people to choose different numbers of clusters according to their own experiences. Similarly, in the K-means algorithm, the number of clusters also needs to be preset by the user based on experiences. However, it is often difficult to determine the value of K.

In order to solve the problem that it is difficult to choose the optimal number of clusters, we proposed a K-value selection method based on the SSE and the ET-SSE algorithm in [42].

Ordinarily, the K-means algorithm employs SSE to measure clustering quality as $\begin{matrix} (12) & S S E = \sum_{i = 1}^{K} \sum_{x \in C_{i}} {d x, c_{i}^{'}}^{2}, \end{matrix}$ where $d x, c$ represents the Euclidean distance from data points x and c, and K is the number of clusters. Let $C_{i}$ denote the set of all data points in the i-th cluster, and $c_{i}^{'}$ represents the cluster center of i-th cluster. Usually, the relationship between SSE and K-value is demonstrated in Figure 6(a). It is obvious that with the increase of K, the SSE will decrease and eventually stabilize. When the K-value gradually increases and approaches the actual number of clusters, SSE will decrease rapidly; when the K-value is greater than the actual number of clusters, SSE decreases slowly. Consequently, the K-value corresponding to the obvious “elbow point” in Figure 6(a) can be used as the optimal number of clusters. Nevertheless, sometimes there will be situation as shown in Figure 6(b), where there is no obvious “elbow point” in the figure. In this case, it will be more difficult to select the optimal K-value, which will affect the final clustering results.

[figure(s) omitted; refer to PDF]

To solve the problem that the “elbow point” in Figure 6(b) is not clear, the exponential function $e^{x}$ is introduced into the SSE formula. Using the property that the $e^{x}$ function is sensitive to positive number, the SSE values of different clusters are scaled down to further improve the difference degree of SSE values, when K-value is not equal to the actual number of clusters. Meanwhile, in order to prevent the exponential explosion, an adjustment factor $δ$ is added to update the weight of the SSE value. A new calculation formula of the SSE is defined as follows: $\begin{matrix} (13) & S S E_{K} = \sum_{i = 1}^{K} e^{\sum_{x \in C_{i}} {d x, c_{i}^{'}}^{2} / δ} + e^{\max_{i \leq 1 \leq K} \sum_{x \in C_{i}} {d x, c_{i}^{'}}^{2} / δ}, \end{matrix}$ where max means to find the largest SSE value in K clusters. In order to reduce the influence of manual parameters on the clustering result, based on a large number of experiments on different data sets, it is found that when $\begin{matrix} (14) & δ = \sqrt{\max_{i \leq 1 \leq K} \sum_{x \in C_{i}} {d x, c_{i}^{'}}^{2}}, \end{matrix}$ the “elbow point” in Figure 6 is the most obvious. Then, the predicted number of clusters is closest to the actual number of clusters, and the clustering effect is the best.

4. The Improved Clustering Algorithm

Aiming at the problem of unstable clustering results caused by randomly selecting initial cluster centers in traditional K-means algorithm, this paper proposed an improved clustering algorithm based on the density peak and nearest neighbors. Firstly, using information entropy to improve the DPC algorithm, we find the optimal cutoff distance of data set and then calculate the local density of the data points; additionally, the optimal number of clusters is obtained by the K-value selection method proposed in Section 3. Finally, according to the descending order of values in (5), the top K corresponding data points are selected as the initial cluster centers. The K-means algorithm is used for iterative clustering.

4.1. Weighted Euclidean Distance

Let set $X = x_{1}, x_{2}, \dots, x_{n}$ be a data set containing n data points, and each data point contains m-dimensional attribute characteristics, where $x_{ip} i = 1, \dots, n; p = 1, \dots, m$ represents the p-th attribute characteristic of the i-th data point.

In order to remove the unit restrictions of different attributes in the original data and avoid its impact on the clustering results, the original data need to be normalized and converted into pure numerical data. After the normalization, each attribute is in the same order of magnitude, which is suitable for comprehensive comparative evaluation. The normalization formula is as follows: $\begin{matrix} (15) & x_{ip}^{,} = \frac{x_{i p} - \min x_{: p}}{\max x_{: p} - \min x_{: p}}, \end{matrix}$ where $\max x_{: p}$ and $\min x_{: p}$ , respectively, represent the maximum and minimum of the p dimension attributes in a data point.

The traditional K-means algorithm employs Euclidean distance to measure the similarity between data points. It is applicable to the uniform measurement of each attribute in the data point, and it treats the difference between different attributes equally. But in practice, the contribution of different attributes to the clustering results is quite different. To solve this problem, a weighted Euclidean distance is used to measure the similarity between data points. Let the weight of the l-th dimension attribute of the data set be $\begin{matrix} (16) & w_{l} = \frac{\sum_{i = 1}^{n} {x_{i p}^{'} - \bar{x_{: p}}}^{2}}{\sum_{p = 1}^{m} \sum_{i = 1}^{n} {x_{i p}^{'} - \bar{x_{: p}}}^{2}}, l = 1,2, \dots, m, \end{matrix}$ where $\bar{x_{: p}}$ denotes the average value of the p-th dimension attribute. Then, the weighted Euclidean distance between the data points $x_{1}$ and $x_{2}$ is $\begin{matrix} (17) & d x_{1}, x_{2} = \sqrt{\sum_{l = 1}^{m} w_{l} {x_{1 l} - x_{2 l}}^{2}} . \end{matrix}$

4.2. The Framework of the Proposed Algorithm

The K-means algorithm is sensitive to outliers [43], and the improved DPC algorithm can exclude the influence of outliers on the selection of initial cluster centers. However, K-means is an iterative clustering algorithm, and each iteration will generate new cluster centers. The outliers will have an impact on the generation of new cluster centers. Hence, it is necessary to distinguish the outliers in each cluster.

Let $C = C_{1}, C_{2}, \dots, C_{K}$ be the K clusters after first iteration, and the cluster centers are $c^{'} = c_{1}^{'}, c_{2}^{'}, \dots, c_{K}^{'}$ . The average distance of the i-th cluster is $\begin{matrix} (18) & {MeanDist}_{i} = \frac{\sum_{j = 1}^{n} \sum_{x_{j} \in C_{i}} d x_{j}, c_{i}^{,}}{sum C_{i}}, \end{matrix}$ where $sum C_{i}$ is the number of data points in the i-th cluster. According to the average distance MeanDist of each cluster, the data points in the cluster are divided into core points and outliers. If the distance between the data point $x_{j}$ and its cluster center is less than MeanDist, $x_{j}$ is regarded as a core point; otherwise it is an outlier. The average value of the data points marked as core points in the cluster is calculated as the new cluster center of the next iteration, and the outliers do not participate in the calculation of the new cluster center.

To ensure that the assignment of outliers is closer to real situation and improve the clustering accuracy, the idea of KNN is introduced into the assignment of outliers [44–46]. Suppose that $x_{i}$ is an outlier, calculate the distance sequence $dist = {dist}_{1} {,dist}_{2}, \dots, {dist}_{r}$ from $x_{i}$ to each core point by (17). Then, ${dist}^{'} = {dist}_{1}^{'} {,dist}_{2}^{'}, \dots, {dist}_{r}^{'}$ is an ascending order, and we select the first 2 $^{*}$ K corresponding core points as the core neighbor point $core = {core}_{1} {,core}_{2} {,…core}_{2 K}$ . Since the core points have completed the assignment, the cluster category corresponding to each data point is counted from the sequence Core, and the cluster with the largest number of statistics is regarded as the cluster to which the outlier $x_{i}$ belongs. If there is more than one cluster category with the largest number of statistics, the cluster belonging to the nearest center point is marked as the cluster category of $x_{i}$ , according to the distance from $x_{i}$ to the center point.

In summary, the steps of our proposed algorithm are as follows:

(1) Input the original data set $X = x_{1}, x_{2}, \dots, x_{n}$ .

(2) Normalize the data set X according to (15) to obtain the processed data set X $^{*}$ .

(3) According to (17), the weighted Euclidean distance between each data point in data set X $^{*}$ is obtained.

(4) Calculate $ρ_{i}$ and $σ_{i}$ of each data point in data set X $^{*}$ according to (3), (4), and (11) and determine the optimal cutoff distance $d_{c}$ according to the information entropy in Section 2.2.

(5) Calculate $γ_{i} = ρ_{i} σ_{i}$ and arrange the $γ$ value sequence in descending order to obtain the sequence $γ^{'}$ .

(6) Determine the number of clusters K according to the K-value selection method presented in Section 3.

(7) The first K corresponding data points are selected from the sequence $γ^{'}$ as the initial cluster centers.

(8) The distances from each data point to each cluster center are calculated, and a data point is classified into the cluster with the nearest cluster center.

(9) Calculate the mean distance MeanDist of each cluster according to (18). The data points in each cluster are divided into core points and outliers.

(10) Calculate the average value of the core points in each cluster as the new cluster centers of the next iteration. The idea of nearest neighbors in KNN algorithm is used to re-assign the outliers.

(11) When the cluster centers no longer change, the algorithm terminates and outputs the clustering result. Otherwise, go to Step 8.

Furthermore, a flowchart for the overall process is provided in Figure 7.

[figure(s) omitted; refer to PDF]

4.3. The Time Complexity Analysis

The time complexity of the improved DPC algorithm mainly depends on the local density $ρ$ and distance $σ$ , which is $O n^{2}$ . The time complexity of K-value selection method mainly comes from the sum of the squared error (SSE), which is $O K n^{2}$ . The time complexity of the idea based on nearest neighbors is $O n^{2}$ . Therefore, the total time complexity of our proposed algorithm is $O n^{2}$ .

5. Experiments

5.1. Experimental Environment and the Data Sets

The hardware environment is based on Windows 10 Professional 64-bit, Inter Core i3-4000M CPU, 2.40 GHz, and 4 GB memory. The proposed algorithm is implemented in MATLAB R2011a. The data of Wine, Pima, WDBC, Iris, and Parkinsons in the UCI real data sets [47] are used as the experimental data sets, as shown in Table 1.

Table 1

The description of experimental data sets.

Data set	Sample	Attributes	Clusters
Wine	178	13	3
Pima	768	8	2
WDBC	569	30	2
Iris	150	4	3
Parkinsons	195	22	2

5.2. The Evaluation Measures

This paper uses accuracy (ACC), adjusted Rand index (ARI), and adjusted mutual information (AMI) to evaluate the performance of the clustering algorithms. Assume that $P_{j}$ is a known manually labeled cluster, and $C_{j}$ is a cluster generated by the clustering algorithm. The ACC calculation formula is as follows: $\begin{matrix} (19) & A C C P_{j}, C_{j} = \frac{P_{j} \cap C_{j}}{C_{j}}, \end{matrix}$ where the value range of ACC is [0, 1] and the value range of ARI and AMI is [−1, 1]. The specific meanings and calculations of ARI and AMI are referred to [48, 49]. The three evaluation measures are positively correlated with clustering performance. The larger the value, the better the clustering performance of the algorithm.

5.3. Experimental Results and Discussion

In this paper, we compare our proposed algorithm with several existing algorithms: K-means, DPC [13], CNACS-K-means [50], and DCC-K-means [51]. Due to the influence of the initial cluster centers, the K-means algorithm takes the mean value of 20 times. The experimental results on the UCI data sets are shown in Tables 2–4. The data in bold are the best experimental results on this data set.

Table 2

The comparison of ACC for the five algorithms.

Data set	ACC
Data set	Our algorithm	DPC	K-means	CNACS-K-means	DCC-K-means
Wine	0.9326	0.5337	0.8972	0.9157	0.9270
Pima	0.6654	0.6458	0.6678	0.4128	0.6680
WDBC	0.9279	0.7733	0.9332	0.7381	0.9332
Iris	0.8933	0.6667	0.8287	0.6667	0.6667
Parkinsons	0.7026	0.7436	0.6972	0.7333	0.6769

Table 3

The comparison of ARI for the five algorithms.

Data set	ARI
Data set	Our algorithm	DPC	K-means	CNACS-K-means	DCC-K-means
Wine	0.8025	0.2534	0.7605	0.757	0.7882
Pima	0.1083	0.1072	0.1022	0.1199	0.1024
WDBC	0.7313	0.2745	0.7483	0.6029	0.7483
Iris	0.7312	0.5681	0.7217	0.5681	0.5399
Parkinsons	0.1597	0.1599	0.1472	0.1642	0.1187

Table 4

The comparison of AMI for the five algorithms.

Data set	AMI
Data set	Our algorithm	DPC	K-means	CNACS-K-means	DCC-K-means
Wine	0.7898	0.3272	0.7480	0.7335	0.7785
Pima	0.8912	0.4537	0.5012	0.6846	0.5026
WDBC	0.6123	0.2505	0.6387	0.3989	0.6387
Iris	0.7578	0.5768	0.3648	0.5768	0.5194
Parkinsons	0.1358	0.0547	0.2010	0.0648	0.2127

As shown in Tables 2–4, the algorithm we proposed is generally better than the other four comparison algorithms in terms of the three evaluation measures. It achieves the best experimental results especially on the Wine and Iris data sets.

In the comparison of ACC in Table 2, our algorithm is slightly worse than the K-means and DCC-K-means algorithms on the Pima and WDBC data sets, but is better than the other two comparison algorithms. However, the situation is just the opposite on the Parkinsons data set.

As for ARI comparison in Table 3, the proposed algorithm is lower than the CNACS-K-means on the Pima data set, whereas it is higher than the other three comparison algorithms. On the WDBC data set, it is only lower than K-means and DCC-K-means algorithms by 1.7%. Furthermore, it is lower than CNACS-K-means and 0.02% difference from DPC on the Parkinsons data set, but it is higher than the other two comparison algorithms.

In Table 4, the comparison of AMI shows that our algorithm is superior to the other four comparison algorithms on the Pima data set. On the WDBC data set, it is close to the K-means and DCC-K-means algorithms, and superior to the other two comparison algorithms. Although our algorithm outperforms the DPC and CNACS-K-means algorithms, it is indeed inferior to the other two comparison algorithms on the Parkinsons data set.

6. Conclusion

Focusing on the problems of randomly selecting initial cluster centers, manually determining the number of clusters, and not considering the influence of outliers on clustering process, this paper proposes an improved clustering algorithm based on the density peak and the nearest neighbors. Our algorithm uses an improved DPC algorithm to determine the initial cluster centers and calculates the sum of the squared errors within the clusters to find the optimal cluster number K. Moreover, the average distance within the clusters and the nearest neighbor idea are combined to determine the outliers and its assignment. The experimental results show that the proposed algorithm can achieve better clustering results on the UCI real data sets. However, it is not sure that the algorithm can be applied to large-scale data sets. In future, how to improve the stability and running efficiency of our algorithm for large-scale data set will be the focus.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61902262) and Handan City Science and Technology Research and Development Program (19422031008-15).

References

[1] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P. Patel, A. Tiwari, M. J. Er, W. P. Ding, C. T. Lin, "A review of clustering techniques and developments," Neurocomputing, vol. 267, pp. 664-681, DOI: 10.1016/j.neucom.2017.06.053, 2017.

[2] G. M. Mazzeo, E. Masciari, C. Zaniolo, "A fast and accurate algorithm for unsupervised clustering around centroids," Information Sciences, vol. 400-401, pp. 63-90, DOI: 10.1016/j.ins.2017.03.002, 2017.

[3] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, A. Khalil, A. Y. Zomaya, S. Foufou, A. Bouras, "A survey of clustering algorithms for big data: taxonomy and empirical analysis," IEEE Transactions on Emerging Topics in Computing, vol. 2 no. 3, pp. 267-279, DOI: 10.1109/tetc.2014.2330519, 2014.

[4] A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31 no. 8, pp. 651-666, DOI: 10.1016/j.patrec.2009.09.011, 2010.

[5] J. MacQueen, "Some methods for classification and analysis of multivariate observations," Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, .

[6] J. C. Yang, C. Zhao, "Survey on K-Means clustering algorithm," Computer Engineering and Applications, vol. 55 no. 23, 2019.

[7] H. Ismkhan, I. k-means, "I-k-means−+: an iterative clustering algorithm based on an enhanced version of the k-means," Pattern Recognition, vol. 79, pp. 402-413, DOI: 10.1016/j.patcog.2018.02.015, 2018.

[8] J. S. Lei, T. Jiang, K. Wu, H. Z. Du, G. K. Zhu, Z. Q. Wang, "Robust K-means algorithm with automatically splitting and merging clusters and its applications for surveillance data," Multimedia Tools and Applications, vol. 75 no. 19, pp. 12043-12059, DOI: 10.1007/s11042-016-3322-5, 2016.

[9] G. Tzortzis, A. Likas, "The MinMax k-Means clustering algorithm," Pattern Recognition, vol. 47 no. 7, pp. 2505-2516, DOI: 10.1016/j.patcog.2014.01.015, 2014.

[10] J. Qi, Y. Yu, L. Wang, J. Liu, Y. Wang, "An effective and efficient hierarchical K-means clustering algorithm," International Journal of Distributed Sensor Networks, vol. 13 no. 8,DOI: 10.1177/1550147717728627, 2017.

[11] G. Zhang, C. Zhang, H. Zhang, "Improved K-means algorithm based on density canopy," Knowledge-Based Systems, vol. 145, pp. 289-297, DOI: 10.1016/j.knosys.2018.01.031, 2018.

[12] H. Xie, L. Zhang, C. P. Lim, J. Yu, C. Liu, H. Liu, J. Walters, "Improving K-means clustering with enhanced firefly algorithms," Applied Soft Computing, vol. 84,DOI: 10.1016/j.asoc.2019.105763, 2019.

[13] A. Rodriguez, A. Laio, "Clustering by fast search and find of density peaks," Science, vol. 344 no. 6191, pp. 1492-1496, DOI: 10.1126/science.1242072, 2014.

[14] B. Wu, H. L. Lu, H. J. Jiang, "Adaptive density peak clustering algorithm," Computer Applications, vol. 40 no. 06, pp. 1654-1661, 2020.

[15] Y. Zhang, S. Chen, G. Yu, "Efficient distributed density peaks for clustering large data sets in MapReduce," IEEE Transactions on Knowledge and Data Engineering, vol. 28 no. 12, pp. 3218-3230, DOI: 10.1109/tkde.2016.2609423, 2016.

[16] K. G. Flores, S. E. Garza, "Density peaks clustering with gap-based automatic center detection," Knowledge-Based Systems, vol. 206,DOI: 10.1016/j.knosys.2020.106350, 2020.

[17] S. L. Wang, D. K. Wang, C. Y. Li, Y. Li, G. Y. Ding, "Clustering by fast search and find of density peaks with data field," Chinese Journal of Electronics, vol. 25 no. 3, pp. 397-402, DOI: 10.1049/cje.2016.05.001, 2016.

[18] H. B. Du, A. Z. Bai, L. J. Zhu, "K-means algorithm based on improved density peak algorithm," Statistics & Decisions, vol. 34 no. 18, pp. 20-24, 2018.

[19] D. H. Yu, G. J. Liu, M. Z. Guo, X. Y. Liu, S. Yao, "Density peaks clustering based on weighted local density sequence and nearest neighbor assignment," IEEE Access, vol. 7, pp. 34301-34317, DOI: 10.1109/access.2019.2904254, 2019.

[20] J. Y. Xie, H. C. Gao, W. X. Xie, X. H. Liu, P. W. Grant, "Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors," Information Sciences, vol. 354, pp. 19-40, DOI: 10.1016/j.ins.2016.03.011, 2016.

[21] S. Seyedi, Lotfi, P. Moradi, N. Qader, "Dynamic graph-based label propagation for density peaks clustering," Expert Systems with Applications, vol. 115, pp. 314-328, DOI: 10.1016/j.eswa.2018.07.075, 2019.

[22] H. Lu, Z. Shen, X. S. Sang, Q. H. Zhao, J. F. Lu, "Community detection method using improved density peak clustering and nonnegative matrix factorization," Neurocomputing, vol. 415, pp. 247-257, DOI: 10.1016/j.neucom.2020.07.080, 2020.

[23] M. Mehmood, G. Z. Zhang, R. F. Bie, D. Dawood, H. Ahmad, "Clustering by fast search and find of density peaks via heat diffusion," Neurocomputing, vol. 208, pp. 210-217, DOI: 10.1016/j.neucom.2016.01.102, 2016.

[24] Y. H. Liu, Z. G. Ma, F. Yu, "Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy," Knowledge-Based Systems, vol. 133, pp. 208-220, 2017.

[25] W. N. Tong, S. Liu, X. Z. Gao, "A density-peak-based clustering algorithm of automatically determining the number of clusters," Neurocomputing, vol. 458, pp. 655-666, DOI: 10.1016/j.neucom.2020.03.125, 2021.

[26] R. W. Zhou, X. J. Chang, L. Shi, Y. D. Shen, Y. Yang, F. P. Nie, "Person reidentification via multi-feature fusion with adaptive graph learning," IEEE Transactions on Neural Networks and Learning Systems, vol. 31 no. 5, pp. 1592-1601, 2019.

[27] Z. H. Li, F. P. Nie, X. J. Chang, Y. Yang, C. Q. Zhang, N. Sebe, "Dynamic affinity graph construction for spectral clustering using multiple features," IEEE Transactions on Neural Networks and Learning Systems, vol. 29 no. 12, pp. 6323-6332, DOI: 10.1109/tnnls.2018.2829867, 2018.

[28] Z. H. Li, F. P. Nie, X. J. Chang, F. P. Nie, H. X. Zhang, Y. Yang, "Rank-constrained spectral clustering with flexible embedding," IEEE Transactions on Neural Networks and Learning Systems, vol. 29 no. 12, pp. 6073-6082, DOI: 10.1109/tnnls.2018.2817538, 2018.

[29] T. H. Fan, Z. F. Yao, L. Z. Han, B. H. Liu, L. Lv, "Density peaks clustering based on k-nearest neighbors sharing," Concurrency and Computation: Practice and Experience, vol. 33 no. 5,DOI: 10.1002/cpe.5993, 2021.

[30] R. Zhang, T. Du, S. N. Qu, H. W. Sun, "Adaptive density-based clustering algorithm with shared KNN conflict game," Information Sciences, vol. 565, pp. 344-369, DOI: 10.1016/j.ins.2021.02.017, 2021.

[31] Z. Zhou, G. Q. Si, H. D. Sun, K. Qu, W. C. Hou, "A robust clustering algorithm based on the identification of core points and KNN kernel density estimation," Expert Systems with Applications, vol. 195,DOI: 10.1016/j.eswa.2022.116573, 2022.

[32] M. Parmar, D. Wang, X. F. Zhang, A. H. Tan, C. Y. Miao, J. H. Jiang, Y. Zhou, "REDPC: a residual error-based density peak clustering algorithm," Neurocomputing, vol. 348, pp. 82-96, DOI: 10.1016/j.neucom.2018.06.087, 2019.

[33] J. Y. Guan, S. Li, X. X. He, J. J. Chen, "Peak-graph-based fast density peak clustering for image segmentation," IEEE Signal Processing Letters, vol. 28, pp. 897-901, DOI: 10.1109/lsp.2021.3072794, 2021.

[34] M. D. Parmar, W. Pang, D. H. Hao, J. H. Jiang, W. L. Liupu, L. M. Wang, Y. Zhou, "FREDPC: a feasible residual error-based density peak clustering algorithm with the fragment merging strategy," IEEE Access, vol. 7, pp. 89789-89804, DOI: 10.1109/access.2019.2926579, 2019.

[35] C. W. Li, H. M. Chen, T. R. Li, X. L. Yang, "A stable community detection approach for complex network based on density peak clustering and label propagation," Applied Intelligence, vol. 52 no. 2, pp. 1188-1208, DOI: 10.1007/s10489-021-02287-5, 2022.

[36] M. Parmar, D. Wang, A. H. Tan, C. Y. Miao, J. H. Jiang, Y. Zhou, "A Novel Density Peak Clustering Algorithm Based on Squared Residual Error," pp. 43-48, .

[37] S. L. Wang, W. Y. Gan, D. Y. Li, D. R. Li, "Data field for hierarchical clustering," International Journal of Data Warehousing and Mining, vol. 7 no. 4, pp. 43-63, DOI: 10.4018/jdwm.2011100103, 2011.

[38] W. Y. Gan, C. Liu, "An improved clustering algorithm for search density peaks," Journal of Intelligent Systems, vol. 12 no. 02, pp. 229-236, 2017.

[39] I. Bárány, V. H. Vu, "Central limit theorems for Gaussian polytopes," Annals of Probability, vol. 35 no. 4, pp. 1593-1621, 2006.

[40] S. L. Wang, D. K. Wang, C. Y. Li, Y. Li, "Comment on "Clustering by fast search and find of density peaks," 2015. https://arxiv.org/abs/1501.04267

[41] Z. Yang, H. J. Wang, "Improved density peak clustering algorithm based on weighted K-nearest neighbors," Application Research of Computers, vol. 37 no. 03, pp. 667-671, 2020.

[42] J. R. Wang, X. Ma, G. L. Duan, "Improved K-means clustering k value selection algorithm," Computer Engineering and Applications, vol. 55 no. 08, pp. 27-33, 2019.

[43] G. J. Gan, M. K. P. Ng, "k -means clustering with outlier removal," Pattern Recognition Letters, vol. 90,DOI: 10.1016/j.patrec.2017.03.008, 2017.

[44] J. Y. Xie, H. C. Gao, W. X. Xie, "K-nearest neighbor optimization based density peak fast search clustering algorithm," Science China Information Sciences, vol. 46 no. 02, pp. 258-280, 2016.

[45] N. X. Vinh, J. Epps, J. Bailey, "Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance," Journal of Machine Learning Research, vol. 11, pp. 2837-2854, 2010.

[46] M. J. Du, S. F. Ding, H. J. Jia, "Study on density peaks clustering based on k-nearest neighbors and principal component analysis," Knowledge-Based Systems, vol. 99, pp. 135-145, DOI: 10.1016/j.knosys.2016.02.001, 2016.

[47] D. Dua, C. Graff, UCI Machine Learning Repository, 2019.

[48] Y. H. Lu, C. Xia, "Optimal k-nearest neighbors and local density-based clustering algorithm for uncertain data," Control and Decision-making, vol. 31 no. 3, pp. 541-546, 2016.

[49] R. Liu, H. Wang, X. M. Yu, "Shared-nearest-neighbor-based clustering by fast search and find of density peaks," Information Sciences, vol. 450, pp. 200-226, DOI: 10.1016/j.ins.2018.03.031, 2018.

[50] R. Y. Jia, Y. G. Li, "K-means algorithm for self-determining the number of clusters and initial center points," Computer Engineering and Applications, vol. 54 no. 07, pp. 152-158, 2018.

[51] S. X. Tian, L. Q. Ding, J. Q. Zheng, "K-means text clustering algorithm based on density peak optimization," Computer Engineering and Design, vol. 38 no. 04, pp. 1019-1023, 2017.

Word count: 5280

Show less

Copyright © 2022 Chao Zhao et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

Aiming at the problems that the initial cluster centers are randomly selected and the number of clusters is manually determined in traditional clustering algorithm, which results in unstable clustering results, we propose an improved clustering algorithm based on density peak and nearest neighbors. Firstly, an improved density peak clustering method is proposed to optimize the cutoff distance and local density of data points. It avoids that random selection of initial cluster centers is easy to fall into the local optimal solution. Furthermore, a K-value selection method is presented to choose the optimal number of clusters, which is determined by the sum of the squared errors within the clusters. Finally, we employ the idea of the K-nearest neighbors to carry out the assignment for outliers. Experiments on the UCI real data sets indicate that our proposed algorithm can achieve better clustering results compared with several known algorithms.

Details

Title

An Improved Clustering Algorithm Based on Density Peak and Nearest Neighbors

Author

Zhao, Chao¹

; Yang, Junchuang²; Wen, Kexin³

¹ School of Information and Electrical Engineering, Hebei University of Engineering, Handan 056038, China
² Beijing Wodong Tianjun Information Technology Co. Ltd., Beijing 102676, China
³ School of Management Engineering and Business, Hebei University of Engineering, Handan 056038, China

Editor

Weifeng Pan

Publication year

2022

Publication date

2022

Publisher

John Wiley & Sons, Inc.

ISSN

1024123X

e-ISSN

15635147

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2022/5499213

ProQuest document ID

2704754668

An Improved Clustering Algorithm Based on Density Peak and Nearest Neighbors

Jump to:

Full text

Abstract

Details

Suggested sources