一种基于混合模型的数据流概念漂移检测算法

郭躬德  李  南  陈黎飞

一种基于混合模型的数据流概念漂移检测算法

郭躬德李南陈黎飞

Concept Drift Detection for Data Streams Based on Mixture Model

Guo Gongde, Li Nan, and Chen Lifei

摘要

摘要: 由于在信用卡欺诈分析等领域的广泛应用，学者们开始关注概念漂移数据流分类问题.现有算法通常假设数据一旦分类后类标已知，利用所有待分类实例的真实类别来检测数据流是否发生概念漂移以及调整分类模型.然而，由于标记实例需要耗费大量的时间和精力，该解决方案在实际应用中无法实现.据此，提出一种基于KNNModel和增量贝叶斯的概念漂移检测算法KnnM-IB.新算法在具有KNNModel算法分类被模型簇覆盖的实例分类精度高、速度快优点的同时，利用增量贝叶斯算法对难处理样本进行分类，从而保证了分类效果.算法同时利用可变滑动窗口大小的变化以及主动学习标记的少量样本进行概念漂移检测.当数据流稳定时，半监督学习被用于扩大标记实例的数量以对模型进行更新，因而更符合实际应用的要求.实验结果表明，该方法能够在对数据流进行有效分类的同时检测数据流概念漂移及相应地更新模型.

Abstract: As its application in credit card fraud detection and many other fields, more and more scholars are paying attention to the classification for concept drifting data streams. Most existing algorithms assume that the true labels of the testing instances can be accessed right after they are classified, and utilize them to detect concept drift and adjust current model. It is an impractical assumption in real-world because manual labeling of instances which arrive continuously at a high speed requires a lot of time and effort. For the problem mentioned above, this paper proposes a concept drift detection method based on KNNModel algorithm and incremental Bayes algorithm which is called KnnM-IB. The proposed method has the virtue of the KNNModel algorithm when classifying instances covered by the model clusters. In addition, the incremental Bayes algorithm is used to handle the confused instances and update the model. Using the change of the window size and the few labeled most informative instances which are chosen by active learning, the KnnM-IB algorithm can detect the concept drift on data streams. Semi-supervised learning technology is also used to increase the number of the labeled instances to update the model when the underlying concept of the data streams is stable. Experimental results show that compared with the traditional classification algorithms, the proposed method not only adapts to the situation of concept drift, but also acquires the comparable or better classification accuracy.

HTML全文

参考文献(0)

施引文献

资源附件(0)