基于向量内积不等式的分布式k均值聚类算法

An Effective Distributed k-Means Clustering Algorithm Based on the Pretreatment of Vectors' Inner-Product

摘要: 聚类分析是数据挖掘领域的一项重要研究课题.随着数据量的急剧增加，针对大数据集的聚类分析成为一个难点.虽然k均值算法具有易实现、复杂度与数据集大小成线性关系的优点，将其应用于大数据集时仍然存在效率低的问题.分布式聚类是解决这一问题的有效方法.在已有分布式聚类算法k-DMeans基础上，结合向量内积不等式关系对算法加以优化，提出分布式聚类算法k-DCBIP. 理论分析和实验结果表明，算法k-DCBIP优于k-DMeans，可以有效地解决大数据集聚类问题，算法是有效可行的.

Abstract: Clustering is an important research in data mining. Clustering in large data sets becomes a nut with the accumulating of the data. Despite its simplicity and its linear time, a serial k-Means algorithm's time complexity remains expensive when it is applied to a large data set. Distributed clustering is an effective method to solve this problem. In this paper, the knowledge of vectors' inner product inequation is adopted to improve efficiency of the existing parallel k-Means algorithm(k-DMeans), and an effective distributed k-Means clustering algorithm k-DCBIP is proposed. Theoretical analysis and experimental results testify that k-DCBIP outperforms the algorithm k-DMeans, and it is effective and efficient.