EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm
-
摘要: 聚类分析是数据挖掘中经常用到的一种分析数据之间关系的方法.它把数据对象集合划分成多个不同的组或簇,每个簇内的数据对象之间的相似性要高于与其他簇内的对象的相似性.密度中心聚类算法是一个最近发表在《Science》上的新型聚类算法,它通过评估每个数据对象的2个属性值(密度值ρ和斥群值δ)来进行聚类.相对于其他传统聚类算法,它的优越性体现在交互性、无迭代性、无数据分布依赖性等方面.但是密度中心聚类算法在计算每个数据对象的密度值和斥群值时,需要O(N\+2)复杂度的距离计算,当处理海量高维数据时,该算法的效率会受到很大的影响.为了提高该算法的效率和扩展性,提出一种高效的分布式密度中心聚类算法EDDPC (efficient distributed density peaks clustering),它利用Voronoi分割与合理的数据复制及过滤,避免了大量无用的距离计算开销和数据传输开销.实验结果显示:与简单的MapReduce分布式实现比较,EDDPC可以达到40倍左右的性能提升.Abstract: Clustering is a commonly used method for data relationship analytics in data mining. The clustering algorithm divides a set of objects into several groups (clusters), and the data objects in the same group are more similar to each other than to those in other groups. Density peaks clustering is a recently proposed clustering algorithm published in Science magazine, which performs clustering in terms of each data object’s ρ value and δ value. It exhibits its superiority over the other traditional clustering algorithms in interactivity, non-iterative process, and non-assumption on data distribution. However, computing each data object’s ρ and δ value requires to measure distance between any pair of objects with high computational cost of O(N\+2). This limits the practicability of this algorithm when clustering high-volume and high-dimensional data set. In order to improve efficiency and scalability, we propose an efficient distributed density peaks clustering algorithm—EDDPC, which leverages Voronoi diagram and careful data replicationfiltering to reduce huge amount of useless distance measurement cost and data shuffle cost. Our results show that our EDDPC algorithm can improve the performance significantly (up to 40x) compared with naive MapReduce implementation.
-
Keywords:
- density peaks /
- data clustering /
- Voronoi partition /
- MapReduce /
- big data
-
-
期刊类型引用(5)
1. 谢朝武,黄锐. 目的地旅游安全事件集群:概念框架与测度体系研究. 旅游学刊. 2023(05): 42-57 . 百度学术
2. 严定宇,张宇鹏,陆希玉,曹华平. 对网络空间安全建模的系统思考. 网络安全与数据治理. 2023(12): 34-40 . 百度学术
3. 刘小虎,张恒巍,马军强,张玉臣,谭晶磊. 基于攻防博弈的网络防御决策方法研究综述. 网络与信息安全学报. 2022(01): 1-14 . 百度学术
4. 杨轶杰,朱广劼,司群,杨文. 铁路网络空间可视化实现路径分析. 铁路计算机应用. 2021(11): 15-20 . 百度学术
5. 刘小虎,张恒巍,张玉臣,胡浩,程建. 基于博弈论的网络攻防行为建模与态势演化分析. 电子与信息学报. 2021(12): 3629-3638 . 百度学术
其他类型引用(3)
计量
- 文章访问数:
- HTML全文浏览量: 0
- PDF下载量:
- 被引次数: 8