Abstract:
Outlier mining has become a hot issue in the field of data mining, which is to find exceptional objects that deviate from the most rest of the data set. However, along with the increase of dimension, some unusual characteristic appearance becomes possible, such as spatial distribution of the data, and the distance of full attribute space is no longer meaningful, which is called “curse of dimensionality”. Phenomena of “curse of dimensionality” deteriorate lots of existing outlier detection algorithms’ validity. Concerning this problem, a local entropy based weighted subspace outlier mining algorithm SPOD is proposed, which generates outlier subspace and weighted attribute vector of each data object by analyzing entropy of each attribute on the neighborhood of this data object. For a given data object, those outlier attributes which constitute this object’s outlier subspace, are assigned with bigger weight. Furthermore definitions such as subspace weighted distance are introduced to make a density-based outlier processing upon the data set and get each data point’s subspace outlier influence factor. The bigger this factor is, the bigger the possibility of the corresponding data point becoming an outlier is. Theoretical analysis and experimental results testify that SPOD is suitable for datasets with high dimension, and is efficient and effective.