基于网格和密度的海量数据增量式离群点挖掘算法

张净; 孙志挥; 杨明; 倪巍伟; 杨宜东

基于网格和密度的海量数据增量式离群点挖掘算法

Fast Incremental Outlier Mining Algorithm Based on Grid and Capacity

摘要

摘要: 处理海量和高维数据已经成为设计离群点算法面临的重要任务和挑战,针对海量数据的特点提出一种基于网格和密度的增量式离群点挖掘算法IGDLOF,算法的基本思想为:采用网格的七元组信息减少数据维数和数量,利用增量更新减少内存需求.通过代表点过滤相应的主体数据,先判断再进行近似密度计算的方法减少计算量,降低算法的复杂度.通过在真实和仿真数据集的测试表明,IGDLOF增量算法可与LOF算法保持相同的精确度,而执行效率得到显著的提高.

Abstract: Outlier mining is an important branch in the area of data mining. It has been widely applied to many fields such as industrial and financial applications for IDS and detecting credit card fraud. Dealing with massive and high dimensional data has become tasks and challenges for outlier algorithm to be faced. Based on the definitions of density and grid, a fast incremental outlier mining algorithm is proposed. It introduces seven-tuple information grid to reduce the number and dimension of data, and use incremental updates to reduce memory requirements. Dense grid, sparseness grid and neighbor grid are defined, which could make computation deal with grid conveniently. Through the appropriate representative point filtering the main data, an approximate method to reduce computation and decrease the complexity of the algorithm is adopted. The experiments are performed on different initial datasets and incremental datasets. And the results demonstrate the detection rate, false rate alarm rate, precisions and average running time. The real and simulated data sets of tests show that the proposed algorithm can maintain the same accuracy with LOF algorithm, but the implementation efficiency is improved significantly.

HTML全文

参考文献(0)

施引文献

资源附件(0)