不平衡多分类问题的连续AdaBoost算法研究
Real AdaBoost Algorithm for Multi-Class and Imbalanced Classification Problems
-
摘要: 现有AdaBoost系列算法一般没有考虑类的先验分布.针对该问题,基于最小化训练错误率,通过把符号函数表示的训练错误率的极值问题转变成一种指数函数的极值问题,提出了不平衡分类问题连续 AdaBoost算法,给出了该算法的近似误差估计.基于同样的方法,对二分类问题连续AdaBoost算法的合理性给出了一种全新的解释和证明,并推广到多分类问题,得到了多分类问题连续AdaBoost算法,其具有与二分类连续AdaBoost算法完全类似的算法流程.经分析该算法与Bayes统计推断方法等价,并且其训练错误率随着训练的分类器个数增加而减小.理论分析和基于UCI数据集的实验结果表明了不平衡多分类算法的有效性.在连续AdaBoost算法中,不平衡分类问题常被转换成平衡分类问题来处理,但当先验分布极度不平衡时,使用提出的不平衡分类问题连续AdaBoost算法比一般连续AdaBoost算法有更好效果.Abstract: The current AdaBoost algorithms often do not consider the priori distribution among different classes. To solve this problem, by transforming the expression of training error from sign function to exponential function, a real AdaBoost algorithm for imbalanced classification problem is proposed to minimize the training error rate, and its error estimation is also given. By the same way, the real AdaBoost algorithm for two-class classification problem could be explained and proved successfully by a new mechanism different from the current explanation of the real AdaBoost algorithm. And it could be extended to multi-class classification problem with similar algorithm flow and formulas to the AdaBoost algorithm for two-class classification problem, which is proved to be consistent with the Bayes optimization deduction method. It is proved that by using the proposed real AdaBoost algorithm for multi-class classification problem, the training error rate decreases while the number of training classifiers increases. Theoretical analysis and experimental results on UCI dataset show the effectiveness of the proposed real AdaBoost algorithm for imbalanced classification problem. Imbalanced classification problems are often transformed to balanced classification problems by adjusting the weights of training samples in real AdaBoost algorithm, but when the priori distribution is very imbalanced, the proposed real AdaBoost algorithm for imbalanced classification problem is more effective than the existing real AdaBoost algorithms.