基于网格筛选的大规模密度峰值聚类算法

徐晓; 丁世飞; 孙统风; 廖红梅

doi:10.7544/issn1000-1239.2018.20170227

基于网格筛选的大规模密度峰值聚类算法

Large-Scale Density Peaks Clustering Algorithm Based on Grid Screening

摘要

摘要: 密度峰值聚类算法(density peaks clustering algorithm, DPC)是2014年提出的一种新型聚类分析算法,它基于聚类中心局部密度大以及与密度更大点之间的距离较远两大特点绘制决策图寻找聚类中心，从而得到任意形状的簇.但在寻找聚类中心的过程中，求解局部密度以及高密度距离属性都依赖于相似度矩阵的计算，计算复杂度较高，限制了密度峰值聚类算法在大规模数据集中的应用.针对此不足，提出基于网格筛选的密度峰值聚类算法(density peaks clustering algorithm based on grid screening, SDPC)，根据数据的不均匀分布，使用网格化方法去除部分密度稀疏的点，然后再使用密度峰值聚类算法中决策图的方法选取聚类中心，可以在保证聚类准确性的基础上有效降低计算复杂度.理论分析和实验测试表明：基于网格筛选的密度峰值聚类算法不仅可以对大规模数据集进行正确的聚类，还极大地降低了计算复杂度.

Abstract: Clustering by fast search and find of density peaks (density peaks clustering algorithm, DPC) is a new clustering analysis algorithm proposed in 2014. It draws decision graphs according to the theory that the cluster centers have larger local density and the distance between larger density points and cluster centers is far away. And then it finds the density peaks, as to obtain any shapes of the clusters. However, as the computation of local density and distance relies on the similarity matrix in the process of finding cluster centers, the computational complexity is relatively higher, which limits the application of DPC in large-scale datasets. In this paper, a density peaks clustering algorithm based on grid screening (SDPC) is proposed. SDPC removes the points with sparse local density based on the grid method according to the uneven distribution of datasets firstly. And then the cluster centers are selected by drawing the decision graph based on DPC, which reduces the computational complexity effectively on the basis of ensuring the clustering accuracy. Theoretical analysis and experimental results show that DPC based on grid screening can not only cluster large-scale datasets correctly, but also reduce the time complexity greatly.

HTML全文

参考文献(0)

施引文献

资源附件(0)