面向高维数据的低冗余top-k异常点发现方法

陈冠华; 马秀莉; 杨冬青; 唐世渭; 帅  猛; 谢昆青

面向高维数据的低冗余top-k异常点发现方法

Discovering Redundancy-Aware Top-k Anomalies in High Dimensional Data

摘要

摘要: 异常发现是数据挖掘领域的一类重要任务.针对高维对象的异常度量问题和异常点集合的冗余问题，提出了一种新的面向高维数据的异常点发现方法.该方法通过采用高维数据的二部图表示，以高维对象的压缩能力作为其异常程度的度量，能够有效支持包含不同类型属性的高维数据.为了解决top-k异常点集合中的冗余问题，提出了低冗余top-k异常点的概念.由于精确计算低冗余的top-k异常点是NP-hard问题，设计了计算近似低冗余的top-k异常点的启发式方法k-AnomaliesHD算法.从在真实和人工数据集上的实验结果可以看出，该方法具有较好的扩展性；而且与不考虑冗余的异常点发现方法相比较，能够更有效地概括数据中的异常模式.

Abstract: Discovering anomalies is an important data mining task which has been studied in many applications. In this paper, by emphasizing the problems of exception measurement of high dimensional objects and redundancy in the set of anomalies, an approach is proposed to discover the anomalies in high dimensional data. With a bipartite graph representation of the given high dimensional dataset, the capability of compression of each object is used to measure the degree of exception of the object. Based on the exception measure, the dataset containing different types of attributes, such as binary attributes, categorical attributes and numeric attributes, are well supported. To solve the problem of redundancy in the set of top-k anomalies, the concept of redundancy-aware top-k anomalies is proposed. For the problem of mining the exact set of the redundancy-aware top-k anomalies is NP-hard, an algorithm based on greedy heuristics, named k-AnomaliesHD, is designed to discover an approximate set of the redundancy-aware top-k anomalies efficiently. The experimental study both on real and synthetic datasets shows that the algorithm scales linearly with the dimensionality of the dataset and quadratic to the size of the dataset. Further, compared with the redundancy-unaware method, the set of redundancy-aware top-k anomalies is much more effective to cover the abnormal patterns of data.

HTML全文

参考文献(0)

施引文献

资源附件(0)