基于k均值分区的数据流离群点检测算法

An Efficient Data Stream Outliers Detection Algorithm Based on k-Means Partitioning

摘要: 离群知识发现是数据挖掘研究的一个重要方面，数据流离群点挖掘更因其挖掘对象具有动态性、不可复读性、数据量大等特点而成为离群知识发现研究的一个难点.提出一种基于k均值分区的流数据离群点发现算法，先对数据流进行分区做k均值聚类生成中间聚类结果(均值参考点集)，随后在这些均值参考点中，根据离群点的定义找出可能存在的离群点.理论分析和实验结果表明，算法可以有效解决数据流离群点检测问题，算法是有效可行的.

Abstract: Outliers detection is an important issue in data mining. It is difficult to find outliers in data streams because data streams are dynamic, one pass readable and of large amount of data. In this paper, a data stream outliers detection algorithm based on k-means partioning—DSOKP is proposed, which applies k means clustering on each partition of the data stream to generate mean reference point set, and subsequently picks out those potential outliers of each periods according to the definition of outliers. Theoretic analysis and experimental results indicate that DSOKP is effective and efficient.