• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于样本权重的不平衡数据欠抽样方法

熊冰妍, 王国胤, 邓维斌

熊冰妍, 王国胤, 邓维斌. 基于样本权重的不平衡数据欠抽样方法[J]. 计算机研究与发展, 2016, 53(11): 2613-2622. DOI: 10.7544/issn1000-1239.2016.20150593
引用本文: 熊冰妍, 王国胤, 邓维斌. 基于样本权重的不平衡数据欠抽样方法[J]. 计算机研究与发展, 2016, 53(11): 2613-2622. DOI: 10.7544/issn1000-1239.2016.20150593
Xiong Bingyan, Wang Guoyin, Deng Weibin. Under-Sampling Method Based on Sample Weight for Imbalanced Data[J]. Journal of Computer Research and Development, 2016, 53(11): 2613-2622. DOI: 10.7544/issn1000-1239.2016.20150593
Citation: Xiong Bingyan, Wang Guoyin, Deng Weibin. Under-Sampling Method Based on Sample Weight for Imbalanced Data[J]. Journal of Computer Research and Development, 2016, 53(11): 2613-2622. DOI: 10.7544/issn1000-1239.2016.20150593
熊冰妍, 王国胤, 邓维斌. 基于样本权重的不平衡数据欠抽样方法[J]. 计算机研究与发展, 2016, 53(11): 2613-2622. CSTR: 32373.14.issn1000-1239.2016.20150593
引用本文: 熊冰妍, 王国胤, 邓维斌. 基于样本权重的不平衡数据欠抽样方法[J]. 计算机研究与发展, 2016, 53(11): 2613-2622. CSTR: 32373.14.issn1000-1239.2016.20150593
Xiong Bingyan, Wang Guoyin, Deng Weibin. Under-Sampling Method Based on Sample Weight for Imbalanced Data[J]. Journal of Computer Research and Development, 2016, 53(11): 2613-2622. CSTR: 32373.14.issn1000-1239.2016.20150593
Citation: Xiong Bingyan, Wang Guoyin, Deng Weibin. Under-Sampling Method Based on Sample Weight for Imbalanced Data[J]. Journal of Computer Research and Development, 2016, 53(11): 2613-2622. CSTR: 32373.14.issn1000-1239.2016.20150593

基于样本权重的不平衡数据欠抽样方法

基金项目: 国家自然科学基金项目(61272060);教育部人文社科规划基金项目(15XJA630003);重庆市教委科学技术研究项目(KJ1500416);重庆市自然科学基金项目(CSTC2013jjB40003) This work was supported by the National Natural Science Foundation of China (61272060), the Social Science Foundation of the Chinese Education Commission (15XJA630003), the Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJ1500416), and the Key Natural Science Foundation of Chongqing (CSTC2013jjB40003).
详细信息
  • 中图分类号: TP391

Under-Sampling Method Based on Sample Weight for Imbalanced Data

  • 摘要: 现实世界中广泛存在不平衡数据,其分类问题是数据挖掘和机器学习的一个研究热点.欠抽样是处理不平衡数据集的一种常用方法,其主要思想是选取多数类样本中的一个子集,使数据集的样本分布达到平衡,但其容易忽略多数类中部分有用信息.为此提出了一种基于样本权重的欠抽样方法KAcBag(K-means AdaCost bagging),该方法引入了样本权重来反映样本所处的区域,首先根据各类样本的数量初始化各样本权重,并通过多次聚类对各个样本的权重进行修改,权重小的多数类样本即处于多数类的中心区域;然后按权重大小对多数类样本进行欠抽样,使位于中心区域的样本较容易被抽中,并与所有少数类样本组成bagging成员分类器的训练数据,得到若干个决策树子分类器;最后根据各子分类器的正确率进行加权投票生成预测模型.对19组UCI数据集和某电信运营商客户换机数据进行了测试实验,实验结果表明:KAcBag方法使抽样所得的样本具有较强的代表性,能有效提高少数类的分类性能并缩小问题规模.
    Abstract: Imbalanced data exists widely in the real world, and its classification is a hot topic in data mining and machine learning. Under-sampling is a widely used method in dealing imbalanced data set and its main idea is choosing a subset of majority class to make the data set balanced. However, some useful majority class information may be lost. In order to solve the problem, an under-sampling method based on sample weight for imbalance problem is proposed, named as KAcBag (K-means AdaCost bagging). In this method, sample weight is introduced to reveal the area where the sample is located. Firstly, according to the sample scale, a weight is made for each sample and is modified after clustering the data set. The samples which have less weight in the center of majority class. Then some samples are drawn from majority class in accordance with the sample weight. In the procedure, the samples in the center of majority class can be selected easily. The sampled majority class samples and all the minority class samples are combined as the training data set for a component classifier. After that, we can get several decision tree sub-classifiers. Finally, the prediction model is constructed based on the accuracy of each sub-classifiers. Experimental tests on nineteen UCI data sets and telecom user data show that KAcBag can make the selected samples have more representativeness. Based on that, this method can improve the the classification performance of minority class and reduce the scale of the problem.
  • 期刊类型引用(5)

    1. 谢朝武,黄锐. 目的地旅游安全事件集群:概念框架与测度体系研究. 旅游学刊. 2023(05): 42-57 . 百度学术
    2. 严定宇,张宇鹏,陆希玉,曹华平. 对网络空间安全建模的系统思考. 网络安全与数据治理. 2023(12): 34-40 . 百度学术
    3. 刘小虎,张恒巍,马军强,张玉臣,谭晶磊. 基于攻防博弈的网络防御决策方法研究综述. 网络与信息安全学报. 2022(01): 1-14 . 百度学术
    4. 杨轶杰,朱广劼,司群,杨文. 铁路网络空间可视化实现路径分析. 铁路计算机应用. 2021(11): 15-20 . 百度学术
    5. 刘小虎,张恒巍,张玉臣,胡浩,程建. 基于博弈论的网络攻防行为建模与态势演化分析. 电子与信息学报. 2021(12): 3629-3638 . 百度学术

    其他类型引用(3)

计量
  • 文章访问数:  1729
  • HTML全文浏览量:  1
  • PDF下载量:  950
  • 被引次数: 8
出版历程
  • 发布日期:  2016-10-31

目录

    /

    返回文章
    返回