邻域平衡密度聚类算法

武佳薇  李雄飞  孙  涛  李  巍

邻域平衡密度聚类算法

武佳薇李雄飞孙涛李巍

A Density-Based Clustering Algorithm Concerning Neighborhood Balance

Wu Jiawei, Li Xiongfei, Sun Tao, and Li Wei

摘要

摘要: 聚类是数据挖掘领域的一项重要分析手段.在分析核心对象与其邻域对象的分布特征后，引入对象的投影点，对象的邻域平衡、平衡核心对象、边界稀疏对象等概念.提出一种新的基于密度的聚类算法bDBSCAN(balance-DBSCAN).算法将核心对象邻域中的对象投影，进行向量单位化，考察核心对象的邻域平衡性，将与平衡核心对象平衡密度可达的对象聚成一个簇.理论分析和实验结果表明，算法可以处理任意形状的簇，有效地排除边界稀疏对象这类噪声，并且可以解决高维数据聚类边界区分不明显、噪声对象多等问题，提高了聚类精度.算法的时间复杂度与DBSCAN近似.

Abstract: Clustering is an important analytical tool in data mining. Density-based clustering analysis is a clustering analysis method which is demanded to deal with very large databases. By analyzing the limitation of the existing density-based clustering algorithms and the problems of disposing various densities of data and illegibility of clusters boundaries, definitions such as projection points, neighborhood balance, balanceable core points, and boundary sparse points are introduced. After analyzing the distribution characters of core points and points in their neighborhood, a density based clustering algorithm bDBSCAN concerning the neighborhood balance of core points is proposed to improve DBSCAN. The algorithm deals with the core points by getting the projection of the points in their neighborhood to judge whether they are balanceable. Only balanceable core points can be expanded to form clusters. The algorithm can discover clusters with arbitrary shape and various data distribution characters effectively and efficiently and eliminate noise such as boundary sparse points. The theoretical analysis and experimental results indicate that the algorithm improves the accuracy of clustering and offers better results of clustering on various data sets and solves the difficulties of clustering high dimensional spatial data such as indistinct boundary between clusters, too many noise data points, etc. Meanwhile the choice and impact of the parameter in the algorithm are discussed.

HTML全文

参考文献(0)

施引文献

资源附件(0)