ISSN 1000-1239 CN 11-1777/TP

• 论文 • 上一篇    下一篇

基于距离的不确定离群点检测

于浩1 王斌1 肖刚1 杨晓春1,2   

  1. 1(东北大学信息科学与工程学院 沈阳 110004) 2(中国人民大学数据工程与知识工程教育部重点实验室 北京 100872) (yangxc@mail.neu.edu.cn)
  • 出版日期: 2010-03-15

Distance-Based Outlier Detection on Uncertain Data

Yu Hao1, Wang Bin1, Xiao Gang1, and Yang Xiaochun1,2   

  1. 1(College of Information Science and Engineering, Northeastern University, Shenyang 110004) 2(Key Laboratory of Data Engineering and Knowledge Engineering for the Ministry of Education, Renmin University of China, Beijing 100872)
  • Online: 2010-03-15

摘要: 在诸如网络入侵、无线传感器网络异常事件等检测应用中,离群点检测是一项具有很高应用价值的技术.这项技术在确定性数据中已经得到了深入的研究,但在新兴的不确定数据领域却是一项新的研究课题.在无线传感器网络、数据集成和数据挖掘等技术中使用不确定数据模型更能真实反映现实世界,进一步提高这些技术的实际可行性.针对不确定数据,提出新的离群点定义.提出基于距离的不确定数据离群点检测的高效过滤方法,包括基础过滤方法b-RFA和改进方法o-RFA,最后提出高效概率计算方法DPA.b-RFA方法利用非离群点的过滤性质,减少检测次数.o-RFA方法通过挖掘数据分布信息对b-RFA方法作出改进,进一步提高过滤效率.DPA方法找到概率求解中的递推规律,极大提高了单点检测效率.实验结果显示:提出的方法可以有效地减少候选集,降低搜索空间,改善在不确定数据上的查询性能.

关键词: 不确定数据, 离群点检测, 过滤方法, 高效, 不确定数据模型

Abstract: Outlier detection is one of the valuable techniques in many applications, such as network intrusion detection, event detection in wireless sensor network (WSN), and so on. This technique has been well studied on deterministic databases. However, it is a new task on emerging uncertain database. Using the new uncertain data model, many real applications, such as wireless sensor network, data integration, and data mining, can be better described. The feasibility of such applications can be further enhanced. In this paper, a new definition of outlier on uncertain data is defined. Based on it, some efficient filtering approaches for outlier detection are proposed, including a basic filtering approach, called b-RFA, and an improved filtering approach, called o-RFA. Moreover, a probability approach, called DPA, is proposed to efficiently detect outlier on uncertain database. The approach b-RFA utilizes the property of non-outlier to reduce the times of detection. Moreover, o-RFA improves b-RFA by mining and using the data distribution. Furthermore, DPA finds the recursion rule in probability computation and greatly improves the efficiency of single data detection. Finally, the experimental results show that the proposed approaches can efficiently prune the candidates and reduce the corresponding searching space, and improve the performance of query processing on uncertain data.

Key words: uncertain data, outlier detection, pruning method, efficiency, uncertain data model