基于局部信息熵的加权子空间离群点检测算法

倪巍伟; 陈  耿; 陆介平; 吴英杰; 孙志挥

基于局部信息熵的加权子空间离群点检测算法

Local Entropy Based Weighted Subspace Outlier Mining Algorithm

摘要

摘要: 离群点检测作为数据挖掘的一个重要研究方向，可以从大量数据中发现少量与多数数据有明显区别的数据对象.“维度灾殃”现象的存在使得很多已有的离群点检测算法对高维数据不再有效.针对这一问题，提出基于局部信息熵的加权子空间离群点检测算法SPOD.通过对数据对象在各维进行邻域信息熵分析，生成数据对象相应的离群子空间和属性权向量，对离群子空间中的属性赋以较高的权值，进一步提出子空间加权距离等概念.采用基于密度离群点检测的思想，分析计算数据对象的子空间离群影响因子，判断是否为离群点.算法能够有效地适应于高维数据离群点检测，理论分析和实验结果表明算法是有效可行的.

Abstract: Outlier mining has become a hot issue in the field of data mining, which is to find exceptional objects that deviate from the most rest of the data set. However, along with the increase of dimension, some unusual characteristic appearance becomes possible, such as spatial distribution of the data, and the distance of full attribute space is no longer meaningful, which is called “curse of dimensionality”. Phenomena of “curse of dimensionality” deteriorate lots of existing outlier detection algorithms’ validity. Concerning this problem, a local entropy based weighted subspace outlier mining algorithm SPOD is proposed, which generates outlier subspace and weighted attribute vector of each data object by analyzing entropy of each attribute on the neighborhood of this data object. For a given data object, those outlier attributes which constitute this object’s outlier subspace, are assigned with bigger weight. Furthermore definitions such as subspace weighted distance are introduced to make a density-based outlier processing upon the data set and get each data point’s subspace outlier influence factor. The bigger this factor is, the bigger the possibility of the corresponding data point becoming an outlier is. Theoretical analysis and experimental results testify that SPOD is suitable for datasets with high dimension, and is efficient and effective.

HTML全文

参考文献(0)

施引文献

资源附件(0)