ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2014, Vol. 51 ›› Issue (10): 2277-2294.doi: 10.7544/issn1000-1239.2014.20130718

• 人工智能 • 上一篇    下一篇

MMCKDE:基于数据流的m-混合聚类核概率密度估计

许敏1,2,邓赵红1,王士同1,史荧中1,2   

  1. 1(江南大学数字媒体学院 江苏无锡 214122);2(无锡职业技术学院物联网技术学院 江苏无锡 214121) (xum@wxit.edu.cn)
  • 出版日期: 2014-10-01
  • 基金资助: 
    国家自然科学基金项目(61271368)

MMCKDE: m-Mixed Clustering Kernel Density Estimation over Data Streams

Xu Min1,2, Deng Zhaohong1, Wang Shitong1, Shi Yingzhong1,2   

  1. 1(School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122); 2(School of Internet of Things Technology, Wuxi Institute of Technology, Wuxi, Jiangsu 214121)
  • Online: 2014-10-01

摘要: 数据流挖掘应用对时间、空间有着较高的要求,因而传统的密度估计方法,如核密度估计法、压缩集密度估计法等并不适用于数据流密度估计.提出一种新颖的面向在线数据流的m-混合聚类核密度估计(m-mixed clustering kernel density estimation, MMCKDE)方法,该方法通过创建MMCKDE节点,用固定个数的混合聚类核获得聚类信息,以代替其他密度估计方法中的所有核.针对数据量不断增加的情况,通过计算Kullback Leibler(KL)距离进行核合并,可进一步以更紧凑的形式表示概率密度估计信息.较之于其他一些方法只能估计整段数据流的密度,MMCKDE方法最终获得的模型不仅适用于整段数据流,还适用于任意时间段上的密度估计.MMCKDE算法同SOMKE算法在不同基准数据集及真实数据集上进行密度估计精度和运行时间的比较.实验结果表明,MMCKDE算法具有更好的性能.

关键词: m-混合聚类核, 核密度估计, 概率密度函数, Kullback Leibler距离, 流数据挖掘

Abstract: In many data stream mining applications, traditional density estimation methods such as kernel density estimation and reduced set density estimation can not apply to the data stream density estimation because of their high computational burden and big storage space. In order to reduce the time and space complexities, a novel online data stream density estimation method by m-mixed clustering kernel is proposed. In the proposed method, MMCKDE nodes are created using a fixed number of mixed clustering kernels to get cluster information instead of all kernels obtained from other density estimation method. In order to further reduce the storage space, MMCKDE nodes can be merged by calculating KL divergence. Finally, the probability density functions over arbitrary time or the entire time can be estimated by the obtained model. We compared the MMCKDE algorithm with the SOMKE algorithm in terms of density estimation accuracy and running time for various stationary data sets. We also investigated the use of MMCKDE over evolving data streams. The experimental results illustrate the effectiveness and efficiency of the proposed method.

Key words: m-mixed clustering kernel, kernel density estimation, probability density functions, Kullback Leibler (KL) divergence, streaming data mining

中图分类号: