Abstract:
CLOPE algorithm has achieved good results in clustering large, sparse categorical datasets with high dimensions. However, it is hard to stably find the global optimal clusters since the data order can affect the result of clustering. To deal with this problem, this paper proposes p-CLOPE algorithm iteratively dividing input data into multiply equal parts and then clustering their different permutations. In each iteration of p-CLOPE algorithm, the input dataset is split into p parts and they are permuted into p! datasets with different part orders, then each dataset is clustered and the optimal clustering is chosen according to the profit as the input of next iterations. In order to handle time complexity of the process, a result reusing strategy is put forward that can improve the speed of clustering, further. Finaly, a distributed solution is put forward that implements p-CLOPE on Hadoop platform and a clustering tool is developed which has been released to the open source community. Experiments show that p-CLOPE can achieve better results than CLOPE. For the Mushroom dataset, when CLOPE achieves optimal results, p-CLOPE can achieve 357% higher profit value than CLOPE. When dealing with big data, parallel p-CLOPE greatly shortens the computing time compared with serial p-CLOPE, and it achieves nearly p! speedup when there is enough computing resource.