Abstract:
In the era of big data, there is a large amount of streaming data emerging. Concept drift, as the most typical and difficult problem in streaming data mining, has received increasing attention. Ensemble learning is a common method for handling concept drift in streaming data. However, after drift occurs, learning models often cannot timely respond to the distribution changes of streaming data and cannot effectively handle different types of concept drift, leading to the decrease in model generalization performance. Aiming at this problem, we propose a two-stage adaptive ensemble learning method for different types of concept drift (TAEL). Firstly, the concept drift type is determined by detecting the drift span. Then, based on different drift types, a “filtering-expansion” two-stage sample processing mechanism is proposed to dynamically select appropriate sample processing strategy. Specifically, during the filtering stage, different non-critical sample filters are created for different drift types to extract key samples from historical sample blocks, making the historical data distribution closer to the latest data distribution and improving the effectiveness of the base learners. During the expansion stage, a block-priority sampling method is proposed, which sets an appropriate sampling scale for the drift type and sets the sampling priority according to the size proportion of the class in the current sample block to which the historical key sample belongs. Then, the sampling probability is determined based on the sampling priority, and a subset of key samples is extracted from the historical key sample blocks according to the sampling probability to expand the current sample block. This alleviates the class imbalance phenomenon after sample expansion, solves the underfitting problem of the current base learner and enhances its stability. Experimental results show that the proposed method can timely respond to different concept drift types, accelerate the convergence speed of online ensemble models after drift occurs, and improve the overall generalization performance of the model.