Under-Sampling Method Based on Sample Weight for Imbalanced Data

Xiong Bingyan; Wang Guoyin; Deng Weibin

doi:10.7544/issn1000-1239.2016.20150593

Journal of Computer Research and Development > 2016 > 53(11): 2613-2622. > DOI: 10.7544/issn1000-1239.2016.20150593 CSTR: 32373.14.issn1000-1239.2016.20150593

Xiong Bingyan, Wang Guoyin, Deng Weibin. Under-Sampling Method Based on Sample Weight for Imbalanced Data[J]. Journal of Computer Research and Development, 2016, 53(11): 2613-2622. DOI: 10.7544/issn1000-1239.2016.20150593

Citation:

PDF (1154 KB)

Under-Sampling Method Based on Sample Weight for Imbalanced Data

(Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065)

More Information

Published Date: October 31, 2016

Graphical Abstract

Abstract

Abstract

Imbalanced data exists widely in the real world, and its classification is a hot topic in data mining and machine learning. Under-sampling is a widely used method in dealing imbalanced data set and its main idea is choosing a subset of majority class to make the data set balanced. However, some useful majority class information may be lost. In order to solve the problem, an under-sampling method based on sample weight for imbalance problem is proposed, named as KAcBag (K-means AdaCost bagging). In this method, sample weight is introduced to reveal the area where the sample is located. Firstly, according to the sample scale, a weight is made for each sample and is modified after clustering the data set. The samples which have less weight in the center of majority class. Then some samples are drawn from majority class in accordance with the sample weight. In the procedure, the samples in the center of majority class can be selected easily. The sampled majority class samples and all the minority class samples are combined as the training data set for a component classifier. After that, we can get several decision tree sub-classifiers. Finally, the prediction model is constructed based on the accuracy of each sub-classifiers. Experimental tests on nineteen UCI data sets and telecom user data show that KAcBag can make the selected samples have more representativeness. Based on that, this method can improve the the classification performance of minority class and reduce the scale of the problem.
- imbalanced data,
- under-sampling,
- sample weight,
- clustering,
- ensemble learning

FullText(HTML)

References (0)

[1]	Zhang Yuhong, Zhi Wenwu, Li Peipei, Hu Xuegang. Semi-Supervised Method for Cross-Lingual Word Embedding Based on an Adversarial Model with Double Discriminators[J]. Journal of Computer Research and Development, 2023, 60(9): 2127-2136. DOI: 10.7544/issn1000-1239.202220036
[2]	Liu Jiefang, Wang Shitong, Wang Jun, Deng Zhaohong. Core Vector Regression for Attribute Effect Control on Large Scale Dataset[J]. Journal of Computer Research and Development, 2017, 54(9): 1979-1991. DOI: 10.7544/issn1000-1239.2017.20160519
[3]	Shu Jian, Tang Jin, Liu Linlan, Hu Gang, Liu Song. Fuzzy Support Vector Regression-Based Link Quality Prediction Model for Wireless Sensor Networks[J]. Journal of Computer Research and Development, 2015, 52(8): 1842-1851. DOI: 10.7544/issn1000-1239.2015.20140670
[4]	Huang Huajuan, Ding Shifei, Shi Zhongzhi. Smooth CHKS Twin Support Vector Regression[J]. Journal of Computer Research and Development, 2015, 52(3): 561-568. DOI: 10.7544/issn1000-1239.2015.20131444
[5]	Yang Chunfang, Liu Fenlin, and Luo Xiangyang. Histograms Difference and Quantitative Steganalysis of JPEG Steganography Based on Relative Entropy[J]. Journal of Computer Research and Development, 2011, 48(8): 1563-1569.
[6]	Xiong Jinzhi, Xu Jianmin, and Yuan Huaqiang. Convergenceness of a General Formulation for Polynomial Smooth Support Vector Regressions[J]. Journal of Computer Research and Development, 2011, 48(3): 464-470.
[7]	Zeng Fanzi, Liang Zhenhua, and Li Renfa. An Approach to Mobile Position Tracking Based on Support Vector Regression and Game Theory[J]. Journal of Computer Research and Development, 2010, 47(10): 1709-1713.
[8]	Ling Ping, Wang Zhe, Zhou Chunguang, Huang Lan. Reduced Support Vector Clustering[J]. Journal of Computer Research and Development, 2010, 47(8): 1372-1381.
[9]	Qiao Lishan, Chen Songcan, Wang Min. Image Thresholding Based on Relevance Vector Machine[J]. Journal of Computer Research and Development, 2010, 47(8): 1329-1337.
[10]	Liu Xiangdong, Luo Bin, and Chen Zhaoqian. Optimal Model Selection for Support Vector Machines[J]. Journal of Computer Research and Development, 2005, 42(4): 576-581.

Cited By

Cited by

Periodical cited type(2)

1.	刘梦君，蒋新宇，石斯瑾，江南，吴笛. 人工智能教育融合安全警示:来自机器学习算法功能的原生风险分析. 江南大学学报(人文社会科学版). 2022(05): 89-101 .
2.	刘波涛，彭长根，吴睿雪，丁红发，谢明明. 面向数字型的轻量级保形加密算法研究. 计算机研究与发展. 2019(07): 1488-1497 . 本站查看