基于概率和代表点的数据流动态聚类算法

毕安琪; 董爱美; 王士同

doi:10.7544/issn1000-1239.2016.20148428

基于概率和代表点的数据流动态聚类算法

A Dynamic Data Stream Clustering Algorithm Based on Probability and Exemplar

摘要

摘要: 为了解决数据流动态聚类问题，提出了一种概率化的基于代表点聚类算法.首先，基于概率框架给出了AP(affinity propagation)聚类算法和EEM(enhanced α-expansion move)聚类算法的联合目标函数，提出了概率化的基于代表点聚类算法;其次，根据样本与其代表点之间的概率，提出了基于概率的漂移动态α-expansion数据流聚类算法.该算法使得新数据的代表点尽可能贴近原始数据的代表点，从而提高聚类性能;另一方面，考虑到原始数据与新数据的相似性，该算法能够处理2种漂移过程中的动态聚类问题：1)新数据与原始数据分享部分数据，其余数据与原始数据相似；2)没有相同的数据，新数据与原始数据有相似关系.在人工合成数据集D31,Birch3以及真实数据集Forest Covertpye,KDD CUP99的实验结果均显示出了所提之算法能够处理数据流聚类问题，并保证聚类性能稳定.

Abstract: We propose an efficient probability drifting dynamic α-expansion clustering algorithm, which is designed for data stream clustering problem. In this paper, we first develop a unified target function of both affinity propagation (AP) and enhanced α-expansion move (EEM) clustering algorithms, namely the probability exemplar-based clustering algorithm. Then a probability drifting dynamic α-expansion (PDDE) clustering algorithm has been proposed considering the probability framework. The framework is capable of dealing with data stream clustering problem when current data points are similar with pervious data points. In the process of clustering, the proposed algorithm ensures that the clustering result of current data points is at least comparable well with that of previous data points. What’s more, the proposed algorithm is capable of dealing with two kinds of similarities between current and previous data points, that is whether current data points share some points with previous data points or not. Besides, experiments based on both synthetic (D31, Birch 3) and real-world dataset (Forest Covertype, KDD CUP99) have indicated the capability of PDDE in clustering data streams. The advantage of the proposed clustering algorithm in contrast to both AP and EEM algorithms has been shown as well.

HTML全文

参考文献(0)

施引文献

资源附件(0)