Abstract:
With the rapid development of computer technology, data grows explosively. There are challenges for the traditional machine learning algorithms to deal with the large scale data. Many parallel algorithms have been proposed to address the scalability problem, such as MapReduce-based K-means algorithm and parallel spectral clustering algorithm. Affinity propagation (AP) clustering algorithm is introduced to address some drawbacks of the traditional clustering methods such as K-means algorithm. However, its scalability and performance still need improving when dealing with large scale data. In this paper, we propose a distributed AP clustering algorithm based on MapReduce, named DisAP. At first, large scale data are partitioned into several smaller subsets randomly. Then each subset is sparsified in parallel by using AP clustering algorithm. The results are fused and then clustered again, which forms a set of high-quality exemplars. Finally, all data are assigned to exemplars in parallel. DisAP is implemented on a Hadoop cluster, and the experiments on synthetic datasets,human face image datasets, and IRIS dataset demonstrate that DisAP can achieve high performance on both scalability and accuracy.