This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Clustering is a set of unsupervised algorithms, which plays an important role in machine learning and data mining. Traditional clustering algorithms can be divided into five categories [1–3]: partitioning, hierarchical, density-based, grid-based, and model-based. K-means is one of the most popular and simplest partition-based clustering algorithms, which has been widely used in many fields [4, 5]. However, K-means algorithm has its own drawbacks, and then many scholars have proposed improvements [6]. It is mainly concentrated in two ways [7–12]: selection of the initial clustering center and determination of the number of clusters K.
In 2014, Rodriguez and Laio proposed the density peak clustering (DPC) [13] algorithm. It suggests that the cluster centers are always surrounded by those data points with low local density, whereas it is far away from those data points with high local density. The DPC algorithm is of great significance to the selection of initial cluster centers and the determination of K-values, but it also has limitations [14–16]. A new strategy [17] attempts to automatically determine the cutoff distance by information entropy, which is based on data potential energy in the data field instead of the local density between data points. There is an approach [18] determining the initial cluster centers according to an improved DPC algorithm, meanwhile using entropy to calculate the weighted Euclidean distance between data points to optimize the K-means algorithm. A DPCSA algorithm [19] is presented which improves the DPC algorithm based on K-nearest neighbor (KNN) algorithm in addition to weighted local density sequence. The fuzzy weighted KNN is used to calculate the local density of data points [20], and it employs two assignment strategies to classify data points which improves the robustness of this clustering algorithm. The DPC-DLP algorithm [21] uses the KNN idea to calculate the global cutoff distance and the local density of the data points, and uses the graph-based label propagation algorithm to assign data points to clusters. The PageRank algorithm can be used to calculate the local density of data points [22], which can avoid the instability of clustering result caused by the cutoff distance. The CFSFDP-HD algorithm [23] uses heat diffusion to calculate the local density of data points to reduce the influence of cutoff distance on clustering result. The ADPC-KNN algorithm [24] is proposed based on the KNN to calculate the local density and automatically selects the initial cluster centers. The DPADN algorithm [25] uses a continuous function to redesign the local density, which can automatically determine the cluster centers. A multi-feature fusion with adaptive graph learning model [26] is developed, which is an unsupervised algorithm to address the issue of person reidentification. An adaptive approach is constructed [27] which is effective in simultaneously learning the affinity graph and feature fusion, resulting in better clustering results. The novel RCFE method [28] is proposed where the number of clusters is guaranteed to converge to the ground truth via a rank constraint on the Laplacian matrix. Finally, there are some other recent work involving density peak and nearest neighbors that are worth investigating [29–36].
Inspired by the ideas of DPC and KNN, this paper proposes an improved clustering algorithm. First of all, using the sum of the squared errors (SSE) to determine the optimal number of clusters—the K-value, then we select K initial cluster centers based on an improved DPC algorithm. In addition, K-means algorithm is applied to carry out iteration, and data points are divided into core points and outliers according to the average distance within each cluster. Finally, combined with the nearest neighbor idea of KNN, outliers are assigned by voting. Experimental results show that our proposed algorithm can achieve better clustering effect in most cases, compared with several known clustering algorithms.
The remainder of this paper is organized as follows. Section 2 reviews the DPC algorithm and introduces several improved methods. Section 3 provides our K-value selection method. Section 4 describes the proposed algorithm in detail. Experimental results are presented and discussed in Section 5. The conclusion is stated in Section 6.
2. Related Work
2.1. The DPC Algorithm
The DPC algorithm indicates that cluster centers are characterized by a higher density than their neighbors and a relatively large distance from data points with higher densities. In this approach, for each data point
There are two ways for calculating the local density
Comparing the two local density calculation methods, it is found that local density obtained by the cutoff kernel is a discrete value, while the other is a continuous value. The use of the cutoff kernel may cause conflicts when calculating the local density of different data points; therefore, the Gaussian kernel is more generally used to calculate the local density.
The distance
For the point with highest density, it conventionally takes
Assume that the original data set is shown in Figure 1. The local density
[figure(s) omitted; refer to PDF]
In order to determine the cluster centers and its number clearly, the quantity
According to
[figure(s) omitted; refer to PDF]
2.2. Set the Cutoff Distance
It is not advisable to manually set the cutoff distance
Assuming p is a percentage, we define
N is the number of distances between any two data points, and dis is an ascending order of these distances. Here, it takes the N
[figure(s) omitted; refer to PDF]
We can see that the cutoff distance
In information theory, Shannon entropy is used to measure the uncertainty of a system. The greater the entropy, the greater the uncertainty [37]. Similarly, the uncertainty of data distribution can be expressed by entropy. Suppose that the local density is
Z is the normalized factor. The relationship between information entropy H and impact factor
[figure(s) omitted; refer to PDF]
2.3. Optimize the Local Density
The DPC algorithm only considers the global structure of data set, and the clustering result is not good enough when it comes to unevenly distributed data sets [41]. When the data distribution is relatively concentrated and the density of each cluster is quite different, the change of local density will lead to the difficulty of selecting the correct cluster centers, thus affecting the final clustering results. Based on the idea of KNN algorithm, the nearest neighbors are introduced into local density calculation. The closer the data point is to the target point, the more contribution it makes to the local density of the target point. The new local density is defined as follows:
3. The K-Value Selection Method
In the DPC algorithm, the boundary between cluster center point and noncluster center point may be unclear, which cause different people to choose different numbers of clusters according to their own experiences. Similarly, in the K-means algorithm, the number of clusters also needs to be preset by the user based on experiences. However, it is often difficult to determine the value of K.
In order to solve the problem that it is difficult to choose the optimal number of clusters, we proposed a K-value selection method based on the SSE and the ET-SSE algorithm in [42].
Ordinarily, the K-means algorithm employs SSE to measure clustering quality as
[figure(s) omitted; refer to PDF]
To solve the problem that the “elbow point” in Figure 6(b) is not clear, the exponential function
4. The Improved Clustering Algorithm
Aiming at the problem of unstable clustering results caused by randomly selecting initial cluster centers in traditional K-means algorithm, this paper proposed an improved clustering algorithm based on the density peak and nearest neighbors. Firstly, using information entropy to improve the DPC algorithm, we find the optimal cutoff distance of data set and then calculate the local density of the data points; additionally, the optimal number of clusters is obtained by the K-value selection method proposed in Section 3. Finally, according to the descending order of values in (5), the top K corresponding data points are selected as the initial cluster centers. The K-means algorithm is used for iterative clustering.
4.1. Weighted Euclidean Distance
Let set
In order to remove the unit restrictions of different attributes in the original data and avoid its impact on the clustering results, the original data need to be normalized and converted into pure numerical data. After the normalization, each attribute is in the same order of magnitude, which is suitable for comprehensive comparative evaluation. The normalization formula is as follows:
The traditional K-means algorithm employs Euclidean distance to measure the similarity between data points. It is applicable to the uniform measurement of each attribute in the data point, and it treats the difference between different attributes equally. But in practice, the contribution of different attributes to the clustering results is quite different. To solve this problem, a weighted Euclidean distance is used to measure the similarity between data points. Let the weight of the l-th dimension attribute of the data set be
4.2. The Framework of the Proposed Algorithm
The K-means algorithm is sensitive to outliers [43], and the improved DPC algorithm can exclude the influence of outliers on the selection of initial cluster centers. However, K-means is an iterative clustering algorithm, and each iteration will generate new cluster centers. The outliers will have an impact on the generation of new cluster centers. Hence, it is necessary to distinguish the outliers in each cluster.
Let
To ensure that the assignment of outliers is closer to real situation and improve the clustering accuracy, the idea of KNN is introduced into the assignment of outliers [44–46]. Suppose that
In summary, the steps of our proposed algorithm are as follows:
(1) Input the original data set
(2) Normalize the data set X according to (15) to obtain the processed data set X
(3) According to (17), the weighted Euclidean distance between each data point in data set X
(4) Calculate
(5) Calculate
(6) Determine the number of clusters K according to the K-value selection method presented in Section 3.
(7) The first K corresponding data points are selected from the sequence
(8) The distances from each data point to each cluster center are calculated, and a data point is classified into the cluster with the nearest cluster center.
(9) Calculate the mean distance MeanDist of each cluster according to (18). The data points in each cluster are divided into core points and outliers.
(10) Calculate the average value of the core points in each cluster as the new cluster centers of the next iteration. The idea of nearest neighbors in KNN algorithm is used to re-assign the outliers.
(11) When the cluster centers no longer change, the algorithm terminates and outputs the clustering result. Otherwise, go to Step 8.
Furthermore, a flowchart for the overall process is provided in Figure 7.
[figure(s) omitted; refer to PDF]
4.3. The Time Complexity Analysis
The time complexity of the improved DPC algorithm mainly depends on the local density
5. Experiments
5.1. Experimental Environment and the Data Sets
The hardware environment is based on Windows 10 Professional 64-bit, Inter Core i3-4000M CPU, 2.40 GHz, and 4 GB memory. The proposed algorithm is implemented in MATLAB R2011a. The data of Wine, Pima, WDBC, Iris, and Parkinsons in the UCI real data sets [47] are used as the experimental data sets, as shown in Table 1.
Table 1
The description of experimental data sets.
Data set | Sample | Attributes | Clusters |
Wine | 178 | 13 | 3 |
Pima | 768 | 8 | 2 |
WDBC | 569 | 30 | 2 |
Iris | 150 | 4 | 3 |
Parkinsons | 195 | 22 | 2 |
5.2. The Evaluation Measures
This paper uses accuracy (ACC), adjusted Rand index (ARI), and adjusted mutual information (AMI) to evaluate the performance of the clustering algorithms. Assume that
5.3. Experimental Results and Discussion
In this paper, we compare our proposed algorithm with several existing algorithms: K-means, DPC [13], CNACS-K-means [50], and DCC-K-means [51]. Due to the influence of the initial cluster centers, the K-means algorithm takes the mean value of 20 times. The experimental results on the UCI data sets are shown in Tables 2–4. The data in bold are the best experimental results on this data set.
Table 2
The comparison of ACC for the five algorithms.
Data set | ACC | ||||
Our algorithm | DPC | K-means | CNACS-K-means | DCC-K-means | |
Wine | 0.9326 | 0.5337 | 0.8972 | 0.9157 | 0.9270 |
Pima | 0.6654 | 0.6458 | 0.6678 | 0.4128 | 0.6680 |
WDBC | 0.9279 | 0.7733 | 0.9332 | 0.7381 | 0.9332 |
Iris | 0.8933 | 0.6667 | 0.8287 | 0.6667 | 0.6667 |
Parkinsons | 0.7026 | 0.7436 | 0.6972 | 0.7333 | 0.6769 |
Table 3
The comparison of ARI for the five algorithms.
Data set | ARI | ||||
Our algorithm | DPC | K-means | CNACS-K-means | DCC-K-means | |
Wine | 0.8025 | 0.2534 | 0.7605 | 0.757 | 0.7882 |
Pima | 0.1083 | 0.1072 | 0.1022 | 0.1199 | 0.1024 |
WDBC | 0.7313 | 0.2745 | 0.7483 | 0.6029 | 0.7483 |
Iris | 0.7312 | 0.5681 | 0.7217 | 0.5681 | 0.5399 |
Parkinsons | 0.1597 | 0.1599 | 0.1472 | 0.1642 | 0.1187 |
Table 4
The comparison of AMI for the five algorithms.
Data set | AMI | ||||
Our algorithm | DPC | K-means | CNACS-K-means | DCC-K-means | |
Wine | 0.7898 | 0.3272 | 0.7480 | 0.7335 | 0.7785 |
Pima | 0.8912 | 0.4537 | 0.5012 | 0.6846 | 0.5026 |
WDBC | 0.6123 | 0.2505 | 0.6387 | 0.3989 | 0.6387 |
Iris | 0.7578 | 0.5768 | 0.3648 | 0.5768 | 0.5194 |
Parkinsons | 0.1358 | 0.0547 | 0.2010 | 0.0648 | 0.2127 |
As shown in Tables 2–4, the algorithm we proposed is generally better than the other four comparison algorithms in terms of the three evaluation measures. It achieves the best experimental results especially on the Wine and Iris data sets.
In the comparison of ACC in Table 2, our algorithm is slightly worse than the K-means and DCC-K-means algorithms on the Pima and WDBC data sets, but is better than the other two comparison algorithms. However, the situation is just the opposite on the Parkinsons data set.
As for ARI comparison in Table 3, the proposed algorithm is lower than the CNACS-K-means on the Pima data set, whereas it is higher than the other three comparison algorithms. On the WDBC data set, it is only lower than K-means and DCC-K-means algorithms by 1.7%. Furthermore, it is lower than CNACS-K-means and 0.02% difference from DPC on the Parkinsons data set, but it is higher than the other two comparison algorithms.
In Table 4, the comparison of AMI shows that our algorithm is superior to the other four comparison algorithms on the Pima data set. On the WDBC data set, it is close to the K-means and DCC-K-means algorithms, and superior to the other two comparison algorithms. Although our algorithm outperforms the DPC and CNACS-K-means algorithms, it is indeed inferior to the other two comparison algorithms on the Parkinsons data set.
6. Conclusion
Focusing on the problems of randomly selecting initial cluster centers, manually determining the number of clusters, and not considering the influence of outliers on clustering process, this paper proposes an improved clustering algorithm based on the density peak and the nearest neighbors. Our algorithm uses an improved DPC algorithm to determine the initial cluster centers and calculates the sum of the squared errors within the clusters to find the optimal cluster number K. Moreover, the average distance within the clusters and the nearest neighbor idea are combined to determine the outliers and its assignment. The experimental results show that the proposed algorithm can achieve better clustering results on the UCI real data sets. However, it is not sure that the algorithm can be applied to large-scale data sets. In future, how to improve the stability and running efficiency of our algorithm for large-scale data set will be the focus.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61902262) and Handan City Science and Technology Research and Development Program (19422031008-15).
[1] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P. Patel, A. Tiwari, M. J. Er, W. P. Ding, C. T. Lin, "A review of clustering techniques and developments," Neurocomputing, vol. 267, pp. 664-681, DOI: 10.1016/j.neucom.2017.06.053, 2017.
[2] G. M. Mazzeo, E. Masciari, C. Zaniolo, "A fast and accurate algorithm for unsupervised clustering around centroids," Information Sciences, vol. 400-401, pp. 63-90, DOI: 10.1016/j.ins.2017.03.002, 2017.
[3] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, A. Khalil, A. Y. Zomaya, S. Foufou, A. Bouras, "A survey of clustering algorithms for big data: taxonomy and empirical analysis," IEEE Transactions on Emerging Topics in Computing, vol. 2 no. 3, pp. 267-279, DOI: 10.1109/tetc.2014.2330519, 2014.
[4] A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31 no. 8, pp. 651-666, DOI: 10.1016/j.patrec.2009.09.011, 2010.
[5] J. MacQueen, "Some methods for classification and analysis of multivariate observations," Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, .
[6] J. C. Yang, C. Zhao, "Survey on K-Means clustering algorithm," Computer Engineering and Applications, vol. 55 no. 23, 2019.
[7] H. Ismkhan, I. k-means, "I-k-means−+: an iterative clustering algorithm based on an enhanced version of the k-means," Pattern Recognition, vol. 79, pp. 402-413, DOI: 10.1016/j.patcog.2018.02.015, 2018.
[8] J. S. Lei, T. Jiang, K. Wu, H. Z. Du, G. K. Zhu, Z. Q. Wang, "Robust K-means algorithm with automatically splitting and merging clusters and its applications for surveillance data," Multimedia Tools and Applications, vol. 75 no. 19, pp. 12043-12059, DOI: 10.1007/s11042-016-3322-5, 2016.
[9] G. Tzortzis, A. Likas, "The MinMax k-Means clustering algorithm," Pattern Recognition, vol. 47 no. 7, pp. 2505-2516, DOI: 10.1016/j.patcog.2014.01.015, 2014.
[10] J. Qi, Y. Yu, L. Wang, J. Liu, Y. Wang, "An effective and efficient hierarchical K-means clustering algorithm," International Journal of Distributed Sensor Networks, vol. 13 no. 8,DOI: 10.1177/1550147717728627, 2017.
[11] G. Zhang, C. Zhang, H. Zhang, "Improved K-means algorithm based on density canopy," Knowledge-Based Systems, vol. 145, pp. 289-297, DOI: 10.1016/j.knosys.2018.01.031, 2018.
[12] H. Xie, L. Zhang, C. P. Lim, J. Yu, C. Liu, H. Liu, J. Walters, "Improving K-means clustering with enhanced firefly algorithms," Applied Soft Computing, vol. 84,DOI: 10.1016/j.asoc.2019.105763, 2019.
[13] A. Rodriguez, A. Laio, "Clustering by fast search and find of density peaks," Science, vol. 344 no. 6191, pp. 1492-1496, DOI: 10.1126/science.1242072, 2014.
[14] B. Wu, H. L. Lu, H. J. Jiang, "Adaptive density peak clustering algorithm," Computer Applications, vol. 40 no. 06, pp. 1654-1661, 2020.
[15] Y. Zhang, S. Chen, G. Yu, "Efficient distributed density peaks for clustering large data sets in MapReduce," IEEE Transactions on Knowledge and Data Engineering, vol. 28 no. 12, pp. 3218-3230, DOI: 10.1109/tkde.2016.2609423, 2016.
[16] K. G. Flores, S. E. Garza, "Density peaks clustering with gap-based automatic center detection," Knowledge-Based Systems, vol. 206,DOI: 10.1016/j.knosys.2020.106350, 2020.
[17] S. L. Wang, D. K. Wang, C. Y. Li, Y. Li, G. Y. Ding, "Clustering by fast search and find of density peaks with data field," Chinese Journal of Electronics, vol. 25 no. 3, pp. 397-402, DOI: 10.1049/cje.2016.05.001, 2016.
[18] H. B. Du, A. Z. Bai, L. J. Zhu, "K-means algorithm based on improved density peak algorithm," Statistics & Decisions, vol. 34 no. 18, pp. 20-24, 2018.
[19] D. H. Yu, G. J. Liu, M. Z. Guo, X. Y. Liu, S. Yao, "Density peaks clustering based on weighted local density sequence and nearest neighbor assignment," IEEE Access, vol. 7, pp. 34301-34317, DOI: 10.1109/access.2019.2904254, 2019.
[20] J. Y. Xie, H. C. Gao, W. X. Xie, X. H. Liu, P. W. Grant, "Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors," Information Sciences, vol. 354, pp. 19-40, DOI: 10.1016/j.ins.2016.03.011, 2016.
[21] S. Seyedi, Lotfi, P. Moradi, N. Qader, "Dynamic graph-based label propagation for density peaks clustering," Expert Systems with Applications, vol. 115, pp. 314-328, DOI: 10.1016/j.eswa.2018.07.075, 2019.
[22] H. Lu, Z. Shen, X. S. Sang, Q. H. Zhao, J. F. Lu, "Community detection method using improved density peak clustering and nonnegative matrix factorization," Neurocomputing, vol. 415, pp. 247-257, DOI: 10.1016/j.neucom.2020.07.080, 2020.
[23] M. Mehmood, G. Z. Zhang, R. F. Bie, D. Dawood, H. Ahmad, "Clustering by fast search and find of density peaks via heat diffusion," Neurocomputing, vol. 208, pp. 210-217, DOI: 10.1016/j.neucom.2016.01.102, 2016.
[24] Y. H. Liu, Z. G. Ma, F. Yu, "Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy," Knowledge-Based Systems, vol. 133, pp. 208-220, 2017.
[25] W. N. Tong, S. Liu, X. Z. Gao, "A density-peak-based clustering algorithm of automatically determining the number of clusters," Neurocomputing, vol. 458, pp. 655-666, DOI: 10.1016/j.neucom.2020.03.125, 2021.
[26] R. W. Zhou, X. J. Chang, L. Shi, Y. D. Shen, Y. Yang, F. P. Nie, "Person reidentification via multi-feature fusion with adaptive graph learning," IEEE Transactions on Neural Networks and Learning Systems, vol. 31 no. 5, pp. 1592-1601, 2019.
[27] Z. H. Li, F. P. Nie, X. J. Chang, Y. Yang, C. Q. Zhang, N. Sebe, "Dynamic affinity graph construction for spectral clustering using multiple features," IEEE Transactions on Neural Networks and Learning Systems, vol. 29 no. 12, pp. 6323-6332, DOI: 10.1109/tnnls.2018.2829867, 2018.
[28] Z. H. Li, F. P. Nie, X. J. Chang, F. P. Nie, H. X. Zhang, Y. Yang, "Rank-constrained spectral clustering with flexible embedding," IEEE Transactions on Neural Networks and Learning Systems, vol. 29 no. 12, pp. 6073-6082, DOI: 10.1109/tnnls.2018.2817538, 2018.
[29] T. H. Fan, Z. F. Yao, L. Z. Han, B. H. Liu, L. Lv, "Density peaks clustering based on k-nearest neighbors sharing," Concurrency and Computation: Practice and Experience, vol. 33 no. 5,DOI: 10.1002/cpe.5993, 2021.
[30] R. Zhang, T. Du, S. N. Qu, H. W. Sun, "Adaptive density-based clustering algorithm with shared KNN conflict game," Information Sciences, vol. 565, pp. 344-369, DOI: 10.1016/j.ins.2021.02.017, 2021.
[31] Z. Zhou, G. Q. Si, H. D. Sun, K. Qu, W. C. Hou, "A robust clustering algorithm based on the identification of core points and KNN kernel density estimation," Expert Systems with Applications, vol. 195,DOI: 10.1016/j.eswa.2022.116573, 2022.
[32] M. Parmar, D. Wang, X. F. Zhang, A. H. Tan, C. Y. Miao, J. H. Jiang, Y. Zhou, "REDPC: a residual error-based density peak clustering algorithm," Neurocomputing, vol. 348, pp. 82-96, DOI: 10.1016/j.neucom.2018.06.087, 2019.
[33] J. Y. Guan, S. Li, X. X. He, J. J. Chen, "Peak-graph-based fast density peak clustering for image segmentation," IEEE Signal Processing Letters, vol. 28, pp. 897-901, DOI: 10.1109/lsp.2021.3072794, 2021.
[34] M. D. Parmar, W. Pang, D. H. Hao, J. H. Jiang, W. L. Liupu, L. M. Wang, Y. Zhou, "FREDPC: a feasible residual error-based density peak clustering algorithm with the fragment merging strategy," IEEE Access, vol. 7, pp. 89789-89804, DOI: 10.1109/access.2019.2926579, 2019.
[35] C. W. Li, H. M. Chen, T. R. Li, X. L. Yang, "A stable community detection approach for complex network based on density peak clustering and label propagation," Applied Intelligence, vol. 52 no. 2, pp. 1188-1208, DOI: 10.1007/s10489-021-02287-5, 2022.
[36] M. Parmar, D. Wang, A. H. Tan, C. Y. Miao, J. H. Jiang, Y. Zhou, "A Novel Density Peak Clustering Algorithm Based on Squared Residual Error," pp. 43-48, .
[37] S. L. Wang, W. Y. Gan, D. Y. Li, D. R. Li, "Data field for hierarchical clustering," International Journal of Data Warehousing and Mining, vol. 7 no. 4, pp. 43-63, DOI: 10.4018/jdwm.2011100103, 2011.
[38] W. Y. Gan, C. Liu, "An improved clustering algorithm for search density peaks," Journal of Intelligent Systems, vol. 12 no. 02, pp. 229-236, 2017.
[39] I. Bárány, V. H. Vu, "Central limit theorems for Gaussian polytopes," Annals of Probability, vol. 35 no. 4, pp. 1593-1621, 2006.
[40] S. L. Wang, D. K. Wang, C. Y. Li, Y. Li, "Comment on "Clustering by fast search and find of density peaks," 2015. https://arxiv.org/abs/1501.04267
[41] Z. Yang, H. J. Wang, "Improved density peak clustering algorithm based on weighted K-nearest neighbors," Application Research of Computers, vol. 37 no. 03, pp. 667-671, 2020.
[42] J. R. Wang, X. Ma, G. L. Duan, "Improved K-means clustering k value selection algorithm," Computer Engineering and Applications, vol. 55 no. 08, pp. 27-33, 2019.
[43] G. J. Gan, M. K. P. Ng, "k -means clustering with outlier removal," Pattern Recognition Letters, vol. 90,DOI: 10.1016/j.patrec.2017.03.008, 2017.
[44] J. Y. Xie, H. C. Gao, W. X. Xie, "K-nearest neighbor optimization based density peak fast search clustering algorithm," Science China Information Sciences, vol. 46 no. 02, pp. 258-280, 2016.
[45] N. X. Vinh, J. Epps, J. Bailey, "Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance," Journal of Machine Learning Research, vol. 11, pp. 2837-2854, 2010.
[46] M. J. Du, S. F. Ding, H. J. Jia, "Study on density peaks clustering based on k-nearest neighbors and principal component analysis," Knowledge-Based Systems, vol. 99, pp. 135-145, DOI: 10.1016/j.knosys.2016.02.001, 2016.
[47] D. Dua, C. Graff, UCI Machine Learning Repository, 2019.
[48] Y. H. Lu, C. Xia, "Optimal k-nearest neighbors and local density-based clustering algorithm for uncertain data," Control and Decision-making, vol. 31 no. 3, pp. 541-546, 2016.
[49] R. Liu, H. Wang, X. M. Yu, "Shared-nearest-neighbor-based clustering by fast search and find of density peaks," Information Sciences, vol. 450, pp. 200-226, DOI: 10.1016/j.ins.2018.03.031, 2018.
[50] R. Y. Jia, Y. G. Li, "K-means algorithm for self-determining the number of clusters and initial center points," Computer Engineering and Applications, vol. 54 no. 07, pp. 152-158, 2018.
[51] S. X. Tian, L. Q. Ding, J. Q. Zheng, "K-means text clustering algorithm based on density peak optimization," Computer Engineering and Design, vol. 38 no. 04, pp. 1019-1023, 2017.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Chao Zhao et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Aiming at the problems that the initial cluster centers are randomly selected and the number of clusters is manually determined in traditional clustering algorithm, which results in unstable clustering results, we propose an improved clustering algorithm based on density peak and nearest neighbors. Firstly, an improved density peak clustering method is proposed to optimize the cutoff distance and local density of data points. It avoids that random selection of initial cluster centers is easy to fall into the local optimal solution. Furthermore, a K-value selection method is presented to choose the optimal number of clusters, which is determined by the sum of the squared errors within the clusters. Finally, we employ the idea of the K-nearest neighbors to carry out the assignment for outliers. Experiments on the UCI real data sets indicate that our proposed algorithm can achieve better clustering results compared with several known algorithms.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 School of Information and Electrical Engineering, Hebei University of Engineering, Handan 056038, China
2 Beijing Wodong Tianjun Information Technology Co. Ltd., Beijing 102676, China
3 School of Management Engineering and Business, Hebei University of Engineering, Handan 056038, China