ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (6): 1400-1409.doi: 10.7544/issn1000-1239.2016.20150616

Previous Articles     Next Articles

EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm

Gong Shufeng Zhang Yanfeng   

  1. (College of Computer Science and Engineering, Northeastern University, Shenyang 110819)
  • Online:2016-06-01

Abstract: Clustering is a commonly used method for data relationship analytics in data mining. The clustering algorithm divides a set of objects into several groups (clusters), and the data objects in the same group are more similar to each other than to those in other groups. Density peaks clustering is a recently proposed clustering algorithm published in Science magazine, which performs clustering in terms of each data object’s ρ value and δ value. It exhibits its superiority over the other traditional clustering algorithms in interactivity, non-iterative process, and non-assumption on data distribution. However, computing each data object’s ρ and δ value requires to measure distance between any pair of objects with high computational cost of O(N\+2). This limits the practicability of this algorithm when clustering high-volume and high-dimensional data set. In order to improve efficiency and scalability, we propose an efficient distributed density peaks clustering algorithm—EDDPC, which leverages Voronoi diagram and careful data replicationfiltering to reduce huge amount of useless distance measurement cost and data shuffle cost. Our results show that our EDDPC algorithm can improve the performance significantly (up to 40x) compared with naive MapReduce implementation.

Key words: density peaks, data clustering, Voronoi partition, MapReduce, big data

CLC Number: