一种基于区域划分的数据流子空间聚类方法

于  翔; 印桂生; 许宪东; 王建伟

一种基于区域划分的数据流子空间聚类方法

A Data Stream Subspace Clustering Algorithm Based on Region Partition

摘要

摘要: 数据流子空间聚类的主要目的是在合理的时间段内准确找到数据流特征子空间中的聚类.现有的数据流子空间聚类算法受参数影响较大，通常要求预先给出聚类数目或特征子空间，且聚类结果不能及时反映数据流的变化情况.针对以上缺陷，提出一种新的数据流子空间聚类算法SC-RP，SC-RP无需预先给出聚类数目或特征子空间，对孤立点不敏感，可实现快速聚类，通过区域树结构记录数据流的变化并及时更新统计信息，进而根据数据流的变化调整聚类结果.通过在真实数据集与仿真数据集上的实验，证明了SC-RP在聚类精度和速度上优于现有的数据流子空间聚类算法，且对聚类数目及数据维度均具有良好的伸缩性.

Abstract: The main aim of data stream subspace clustering is to find clusters in subspace in rational time accurately. The existing data stream subspace clustering algorithms are greatly influenced by parameters. Generally, the number of clusters or feature subspace need predefining, and the clustering result can't describe the changes of data stream accurately. Further，they cannot describe the changes of clusters accurately and the clustering result will be influenced. Due to the flaws mentioned above, we propose a new data stream subspace clustering algorithm, SC-RP, in which the number of clusters or the feature subspace need not predefining. SC-RP has the advantages of fast clustering and being insensitive to outliers. When data stream changes, the changes will be recorded by the data structure named Region-tree, and the corresponding statistics information will be updated. Further SC-RP can regulate clustering results in time. According to the experiments on real datasets and synthetic datasets, SC-RP is superior to the existing data stream subspace clustering algorithms on both clustering precision and clustering speed, and it has good scalability to the number of clusters and dimensions.

HTML全文

参考文献(0)

施引文献

资源附件(0)