ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (5): 1063-1071.doi: 10.7544/issn1000-1239.2016.20148422

Previous Articles     Next Articles

A Clustering Algorithm for Large-Scale Categorical Data and Its Parallel Implementation

Ding Xiangwu1, Guo Tao1, Wang Mei1, Jin Ran2   

  1. 1(College of Computer Science and Technology, Donghua University, Shanghai 201620); 2(Faculty of Computer Science and Information Technology, Zhejing Wanli University, Ningbo, Zhejiang 315100)
  • Online:2016-05-01

Abstract: CLOPE algorithm has achieved good results in clustering large, sparse categorical datasets with high dimensions. However, it is hard to stably find the global optimal clusters since the data order can affect the result of clustering. To deal with this problem, this paper proposes p-CLOPE algorithm iteratively dividing input data into multiply equal parts and then clustering their different permutations. In each iteration of p-CLOPE algorithm, the input dataset is split into p parts and they are permuted into p! datasets with different part orders, then each dataset is clustered and the optimal clustering is chosen according to the profit as the input of next iterations. In order to handle time complexity of the process, a result reusing strategy is put forward that can improve the speed of clustering, further. Finaly, a distributed solution is put forward that implements p-CLOPE on Hadoop platform and a clustering tool is developed which has been released to the open source community. Experiments show that p-CLOPE can achieve better results than CLOPE. For the Mushroom dataset, when CLOPE achieves optimal results, p-CLOPE can achieve 357% higher profit value than CLOPE. When dealing with big data, parallel p-CLOPE greatly shortens the computing time compared with serial p-CLOPE, and it achieves nearly p! speedup when there is enough computing resource.

Key words: categorical data, CLOPE, p-CLOPE, parallel clustering, MapReduce

CLC Number: