ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (3): 539-547.doi: 10.7544/issn1000-1239.2021.20200324

• 人工智能 • 上一篇    下一篇

基于深度集成学习的类极度不均衡数据信用欺诈检测算法

刘颖1,杨轲2   

  1. 1(吉林财经大学管理科学与信息工程学院 长春 130117);2(吉林财经大学税务学院 长春 130117) (lyaihua1995@163.com)
  • 出版日期: 2021-03-01
  • 基金资助: 
    国家社会科学基金项目(20BTJ062)

Credit Fraud Detection for Extremely Imbalanced Data Based on Ensembled Deep Learning

Liu Ying1, Yang Ke2   

  1. 1(School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117);2(School of Taxation, Jilin University of Finance and Economics, Changchun 130117)
  • Online: 2021-03-01
  • Supported by: 
    This work was supported by the National Social Science Foundation of China (20BTJ062).

摘要: 信用欺诈数据分布极度不均衡时, 信息失真、周期性统计误差和报告偏倚所产生的噪声错误对训练模型干扰凸显, 且易产生过拟合现象.鉴于此, 提出一种深度信念神经网络集成算法来解决类极度不均衡的信用欺诈问题.首先, 提出双向联合采样算法克服信息缺失和过拟合问题; 然后, 构造2阶段基分类器簇, 针对支持向量机(support vector machine, SVM)对不均衡数据分布所表现的分类超平面向少数类偏移问题, 利用增强(boosting)算法生成SVM与随机森林(random forest, RF)结合的基分类器簇; 利用深度信念网络(deep belief network, DBN)整合基分类器簇的多元预测, 输出分类结果.考虑传统精度评价指标过度关注多数类样本, 忽视信用欺诈存在违约损失高于利息收益事实, 引入成本-效益指数兼顾正类和负类样本的识别能力, 提高模型对少数类样本预测精度.通过对欧洲信用卡欺诈数据检测发现, 相比于其他相关算法成本-效益指数均值提高3个百分点, 同时, 实验比较样本不均衡比例对算法精度影响, 结果表明在处理极端不均衡数据时所提算法效果更优.

关键词: 信用欺诈, 类极不均衡, 深度信念神经网络, 支持向量机, 成本-效益指数

Abstract: The existence of class imbalance in credit fraud data significantly undermines model performance. In particular, when the sample distribution is extremely unbalanced, noise caused by information distortion, statistical discrepancy and reporting bias will severely damage the process of training models, leading to potential issues such as overfitting. For this reason, this paper proposes an algorithm based on ensembled deep belief network, which is meant to tackle credit fraud data featured by extreme imbalance. First, this paper proposes joint sampling strategy combining under-sampling and over-sampling to retrieve training subset data. Then, we introduce an algorithm of constructing classifier clusters through two stages. Support vector classifiers and random forest classifiers are combined by using Boosting algorithm to overcome classification interface deviation of support vector machine. Finally, deep belief network is exploited to assemble classifiers’ predictions and output final classification result. Besides, traditional evaluation methods put too much emphasis on majority samples, ignoring the reality where the minority matters even more. The revenue cost index that considers identification of both positive and negative samples has thereby been introduced. This paper conducts empirical study on European credit card data and concludes a 3% higher performance on revenue cost index of the proposed algorithm than others’ average. The experiment also evaluates the influence of imbalance ratio over algorithm’s performance and finds that proposed algorithm outperforms others in this aspect.

Key words: credit fraud, extremely imbalanced data, deep belief network (DBN), support vector machine (SVM), revenue cost index

中图分类号: