一种大规模分类数据聚类算法及其并行实现

丁祥武; 郭涛; 王梅; 金冉

doi:10.7544/issn1000-1239.2016.20148422

一种大规模分类数据聚类算法及其并行实现

A Clustering Algorithm for Large-Scale Categorical Data and Its Parallel Implementation

摘要

摘要: CLOPE算法在大规模、稀疏、高维的分类数据集的聚类上取得了很好的聚类效果.然而该算法受输入数据的顺序影响，难以获得稳定且全局最优的聚类结果.因此提出一种基于等分划分再排列思想的p-CLOPE算法对这一缺陷进行改进.在p-CLOPE算法的每一轮迭代过程中，对输入数据集等分为p部分再排列生成不同顺序的p!份数据集，对这些数据集分别聚类并选取最优的聚类结果作为下一轮迭代的输入.为了降低上述过程的时间复杂度，提出了一种中间结果复用策略，较大程度地提高了聚类速度.最后，在Hadoop平台上实现了一个包含p-CLOPE相关算法的开源聚类工具.实验表明：p-CLOPE算法比CLOPE算法取得了更优的聚类结果.对蘑菇数据集，当CLOPE算法取得最优聚类结果时，p-CLOPE比CLOPE取得了高35.7%的收益值；在处理大量数据时，并行p-CLOPE比串行p-CLOPE极大地缩短了聚类时间，并在计算资源充足时，取得了接近p!倍的加速比.

Abstract: CLOPE algorithm has achieved good results in clustering large, sparse categorical datasets with high dimensions. However, it is hard to stably find the global optimal clusters since the data order can affect the result of clustering. To deal with this problem, this paper proposes p-CLOPE algorithm iteratively dividing input data into multiply equal parts and then clustering their different permutations. In each iteration of p-CLOPE algorithm, the input dataset is split into p parts and they are permuted into p! datasets with different part orders, then each dataset is clustered and the optimal clustering is chosen according to the profit as the input of next iterations. In order to handle time complexity of the process, a result reusing strategy is put forward that can improve the speed of clustering, further. Finaly, a distributed solution is put forward that implements p-CLOPE on Hadoop platform and a clustering tool is developed which has been released to the open source community. Experiments show that p-CLOPE can achieve better results than CLOPE. For the Mushroom dataset, when CLOPE achieves optimal results, p-CLOPE can achieve 357% higher profit value than CLOPE. When dealing with big data, parallel p-CLOPE greatly shortens the computing time compared with serial p-CLOPE, and it achieves nearly p! speedup when there is enough computing resource.

HTML全文

参考文献(0)

施引文献

资源附件(0)