面向不同类型概念漂移的双阶段自适应集成学习方法

郭虎升; 张洋; 王文剑

doi:10.7544/issn1000-1239.202330452

面向不同类型概念漂移的双阶段自适应集成学习方法

Two-Stage Adaptive Ensemble Learning Method for Different Types of Concept Drift

摘要

摘要: 大数据时代，流数据大量涌现. 概念漂移作为流数据挖掘中最典型且困难的难题，受到了越来越广泛的关注. 集成学习是处理流数据中概念漂移的常用方法，然而在漂移发生后，学习模型往往无法对流数据的分布变化做出及时响应，且不能有效处理不同类型概念漂移，导致模型泛化性能下降. 针对这个问题，提出一种面向不同类型概念漂移的双阶段自适应集成方法（two-stage adaptive ensemble learning method for different types of concept drift，TAEL）. 该方法首先通过检测漂移跨度来判断概念漂移类型，然后根据不同漂移类型，提出“过滤-扩充”双阶段样本处理机制动态选择合适的样本处理策略. 具体地，在过滤阶段，针对不同漂移类型，创建不同的非关键样本过滤器，提取历史样本块中的关键样本，使历史数据分布更接近最新数据分布，提高基学习器有效性；在扩充阶段，提出一种分块优先抽样方法，针对不同漂移类型设置合适的抽取规模，并根据历史关键样本所属类别在当前样本块上的规模占比设置抽样优先级，再由抽样优先级确定抽样概率，按照抽样概率从历史关键样本块中抽取关键样本子集扩充当前样本块，缓解样本扩充后的类别不平衡现象，解决当前基学习器欠拟合问题的同时增强其稳定性. 实验结果表明，该方法能够对不同类型的概念漂移做出及时响应，加快漂移发生后在线集成模型的收敛速度，提高模型的整体泛化性能.

Abstract: In the era of big data, there is a large amount of streaming data emerging. Concept drift, as the most typical and difficult problem in streaming data mining, has received increasing attention. Ensemble learning is a common method for handling concept drift in streaming data. However, after drift occurs, learning models often cannot timely respond to the distribution changes of streaming data and cannot effectively handle different types of concept drift, leading to the decrease in model generalization performance. Aiming at this problem this paper proposes a two-stage adaptive ensemble learning method for different types of concept drift (TAEL). Firstly, the concept drift type is determined by detecting the drift span. Then, based on different drift types, a "filtering-expansion" two-stage sample processing mechanism is proposed to dynamically select appropriate sample processing strategy. Specifically, during the filtering stage, different non-critical sample filters are created for different drift types to extract key samples from historical sample blocks, making the historical data distribution closer to the latest data distribution and improving the effectiveness of the base learners. During the expansion stage, a block-priority sampling method is proposed, which sets an appropriate sampling scale for the drift type and sets the sampling priority according to the size proportion of the class in the current sample block to which the historical key sample belongs. Then, the sampling probability is determined based on the sampling priority, and a subset of key samples is extracted from the historical key sample blocks according to the sampling probability to expand the current sample block. This alleviates the class imbalance phenomenon after sample expansion, solves the underfitting problem of the current base learner and enhances its stability. Experimental results show that this method can timely respond to different concept drift types, accelerate the convergence speed of online ensemble models after drift occurs, and improve the overall generalization performance of the model.

HTML全文

参考文献(36)

施引文献

资源附件(0)