ISSN 1000-1239 CN 11-1777/TP

• 论文 • 上一篇    下一篇

VOTCL及其在交叉销售问题上的应用研究

周广通 尹义龙 郭心建 董彩玲   

  1. (山东大学计算机科学与技术学院 济南 250101) (zhouguangtong@gmail.com)
  • 出版日期: 2010-09-15

VOTCL and the Study of Its Application on Cross-Selling Problems

Zhou Guangtong, Yin Yilong, Guo Xinjian, and Dong Cailing   

  1. (School of Computer Science and Technology, Shandong University, Jinan 250101)
  • Online: 2010-09-15

摘要: 交叉销售已成为企业盈利的重要手段,如何解决其数据中普遍同时存在的类别不平衡和代价敏感问题是准确预测交叉销售客户的关键,也是难点之一.针对上述问题,提出了一种基于最优阈值的投票方法:VOTCL.该方法首先结合过抽样和欠抽样技术获取多个类别平衡的训练数据集,然后在每个平衡数据集上分别训练得到多个底层学习器,最后利用所提出的基于最优阈值的投票集成方法集成底层学习器得到决策模型.在PAKDD 2007数据挖掘竞赛的交叉销售数据集上,VOTCL预测的AUC值为0.6037.该集成模型在性能上优于单个学习器,这也在一定程度上表明了所提出的基于最优阈值的投票集成方法的有效性.

关键词: 交叉销售, 类别不平衡, 代价敏感, 最优阈值投票, 支持向量机

Abstract: Cross-selling is regarded as one of the most promising strategies to make profits. The authors first describe a typical cross-selling model, followed by analysis showing that class-imbalance and cost-sensitivity usually co-exist in the data sets collected from this domain. In fact, the central issue in real-world cross-selling applications focuses on the identification of potential cross-selling customers. However, the performance of customer prediction suffers from the problem that class-imbalance and cost-sensitivity are arising simultaneously. To address this problem, an effective method called VOTCL is proposed. In the first stage, VOTCL generates a number of balanced training data sets by combining under-sampling and over-sampling techniques; then a base learner is trained on each of the data set in the second stage; finally, VOTCL obtains the final decision-making model by using an optimal threshold based voting scheme. The effectiveness of VOTCL is validated on the cross-selling data set provided by PAKDD 2007 competition where an AUC value of 0.6037 is achieved by using the proposed method. The ensemble model also outperforms a single base learner, which to some extent shows the efficacy of the proposed optimal threshold based voting scheme.

Key words: cross-selling, class-imbalance, cost-sensitive, optimal threshold based voting, support vector machine