Abstract:
Discovering anomalies is an important data mining task which has been studied in many applications. In this paper, by emphasizing the problems of exception measurement of high dimensional objects and redundancy in the set of anomalies, an approach is proposed to discover the anomalies in high dimensional data. With a bipartite graph representation of the given high dimensional dataset, the capability of compression of each object is used to measure the degree of exception of the object. Based on the exception measure, the dataset containing different types of attributes, such as binary attributes, categorical attributes and numeric attributes, are well supported. To solve the problem of redundancy in the set of top-k anomalies, the concept of redundancy-aware top-k anomalies is proposed. For the problem of mining the exact set of the redundancy-aware top-k anomalies is NP-hard, an algorithm based on greedy heuristics, named k-AnomaliesHD, is designed to discover an approximate set of the redundancy-aware top-k anomalies efficiently. The experimental study both on real and synthetic datasets shows that the algorithm scales linearly with the dimensionality of the dataset and quadratic to the size of the dataset. Further, compared with the redundancy-unaware method, the set of redundancy-aware top-k anomalies is much more effective to cover the abnormal patterns of data.