ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (5): 1029-1042.doi: 10.7544/issn1000-1239.2016.20148428

• 人工智能 • 上一篇    下一篇

基于概率和代表点的数据流动态聚类算法

毕安琪1,董爱美1,2,王士同1   

  1. 1(江南大学数字媒体学院 江苏无锡 214122); 2(齐鲁工业大学信息学院 济南 250353) (angela.sue.bi@gmail.com)
  • 出版日期: 2016-05-01
  • 基金资助: 
    国家自然科学基金项目(6127220);山东省高等学校科技计划项目(J14LN05);江苏省普通高校研究生科研创新计划基金项目(KYLX_1124)

A Dynamic Data Stream Clustering Algorithm Based on Probability and Exemplar

Bi Anqi1, Dong Aimei1,2, Wang Shitong1   

  1. 1(School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122); 2(School of Information, Qilu University of Technology, Jinan 250353)
  • Online: 2016-05-01

摘要: 为了解决数据流动态聚类问题,提出了一种概率化的基于代表点聚类算法.首先,基于概率框架给出了AP(affinity propagation)聚类算法和EEM(enhanced α-expansion move)聚类算法的联合目标函数,提出了概率化的基于代表点聚类算法;其次,根据样本与其代表点之间的概率,提出了基于概率的漂移动态α-expansion数据流聚类算法.该算法使得新数据的代表点尽可能贴近原始数据的代表点,从而提高聚类性能;另一方面,考虑到原始数据与新数据的相似性,该算法能够处理2种漂移过程中的动态聚类问题:1)新数据与原始数据分享部分数据,其余数据与原始数据相似;2)没有相同的数据,新数据与原始数据有相似关系.在人工合成数据集D31,Birch3以及真实数据集Forest Covertpye,KDD CUP99的实验结果均显示出了所提之算法能够处理数据流聚类问题,并保证聚类性能稳定.

关键词: 数据流, 能量函数, 概率, 优化算法, 动态聚类

Abstract: We propose an efficient probability drifting dynamic α-expansion clustering algorithm, which is designed for data stream clustering problem. In this paper, we first develop a unified target function of both affinity propagation (AP) and enhanced α-expansion move (EEM) clustering algorithms, namely the probability exemplar-based clustering algorithm. Then a probability drifting dynamic α-expansion (PDDE) clustering algorithm has been proposed considering the probability framework. The framework is capable of dealing with data stream clustering problem when current data points are similar with pervious data points. In the process of clustering, the proposed algorithm ensures that the clustering result of current data points is at least comparable well with that of previous data points. What’s more, the proposed algorithm is capable of dealing with two kinds of similarities between current and previous data points, that is whether current data points share some points with previous data points or not. Besides, experiments based on both synthetic (D31, Birch 3) and real-world dataset (Forest Covertype, KDD CUP99) have indicated the capability of PDDE in clustering data streams. The advantage of the proposed clustering algorithm in contrast to both AP and EEM algorithms has been shown as well.

Key words: data stream, energy function, probability, optimization algorithm, dynamic clustering

中图分类号: