MMCKDE:基于数据流的m-混合聚类核概率密度估计
MMCKDE: m-Mixed Clustering Kernel Density Estimation over Data Streams
-
摘要: 数据流挖掘应用对时间、空间有着较高的要求,因而传统的密度估计方法,如核密度估计法、压缩集密度估计法等并不适用于数据流密度估计.提出一种新颖的面向在线数据流的m-混合聚类核密度估计(m-mixed clustering kernel density estimation, MMCKDE)方法,该方法通过创建MMCKDE节点,用固定个数的混合聚类核获得聚类信息,以代替其他密度估计方法中的所有核.针对数据量不断增加的情况,通过计算Kullback Leibler(KL)距离进行核合并,可进一步以更紧凑的形式表示概率密度估计信息.较之于其他一些方法只能估计整段数据流的密度,MMCKDE方法最终获得的模型不仅适用于整段数据流,还适用于任意时间段上的密度估计.MMCKDE算法同SOMKE算法在不同基准数据集及真实数据集上进行密度估计精度和运行时间的比较.实验结果表明,MMCKDE算法具有更好的性能.Abstract: In many data stream mining applications, traditional density estimation methods such as kernel density estimation and reduced set density estimation can not apply to the data stream density estimation because of their high computational burden and big storage space. In order to reduce the time and space complexities, a novel online data stream density estimation method by m-mixed clustering kernel is proposed. In the proposed method, MMCKDE nodes are created using a fixed number of mixed clustering kernels to get cluster information instead of all kernels obtained from other density estimation method. In order to further reduce the storage space, MMCKDE nodes can be merged by calculating KL divergence. Finally, the probability density functions over arbitrary time or the entire time can be estimated by the obtained model. We compared the MMCKDE algorithm with the SOMKE algorithm in terms of density estimation accuracy and running time for various stationary data sets. We also investigated the use of MMCKDE over evolving data streams. The experimental results illustrate the effectiveness and efficiency of the proposed method.