数据挖掘中平衡偏斜训练集的方法研究
Balancing Method for Skewed Training Set in Data Mining
-
摘要: 分类是数据挖掘的重要任务之一.训练分类器的训练集可能是偏斜数据.传统分类算法处理偏斜训练集,通常会使少数类别样例的分类精度很低.已有的偏斜训练集平衡算法都是针对只有两种目标类的情况.为平衡拥有多种目标类的偏斜训练集,基于同类样例差异较小的思想给出SSGP算法,在同类样例附近增加少数类别样例,且使多种少数类别样例同速增加.并证明SSGP算法不会向数据集中添加噪声样例.为提高效率,用样例取模取代大量相异度计算.实验表明,只需执行一遍SSGP算法就能同时提高多种少数类别样例的分类精度.Abstract: Classification is one of the important tasks in data mining. The training sets that are extracted for training classifiers are usually skewed. Traditional classification algorithms usually result in low predictive accuracy of minority classes when handling skewed training sets. The existing balancing algorithms only deal with the data sets which contain two classes of cases. In order to balance the training sets that have several classes, an algorithm called SSGP is introduced, based on the idea that little difference lies between the same class cases. SSGP forms new minority class cases by interpolating between several minority class cases that lie together, and makes sure that the number of each minority class case increases at the same speed. It is proved that SSGP would not add noise to the data set. To enhance the efficiency, SSGP adopts the modulus in stead of calculating a lot of dissimilarity between cases. The experimental results show that SSGP can improve the predictive accuracy of several minority classes by running once.