高级检索

    基于扩散模型生成数据重构的客户流失预测

    Customer Churn Prediction Based on Generation Data Reconstruction Using Diffusion Model

    • 摘要: 在数据挖掘领域普遍存在数据不平衡影响到模型预测精度的问题,同时还存在未考虑用户隐私保护的问题. 生成伪造数据是一种重要的解决方法,但在以结构化数据为主的场景中,由于存在数据特征维度多且不相关等特点,生成高质量的数据存在挑战. 考虑到扩散模型在图像生成等任务中被成功应用,以客户流失预测为典型应用场景,尝试将扩散模型应用到客户流失预测任务中. 针对该场景数据中的数值型特征和类别型特征,通过高斯扩散模型和多项式扩散模型获得生成数据,并对模型预测效果和数据隐私保护能力进行研究和分析. 在多个领域的客户流失数据上进行了大量实验,探索应用生成数据对真实数据融合重构的可能性. 实验结果表明基于扩散模型可生成高质量数据,且对多种预测方法均有一定提升,可实现缓解数据不平衡问题. 同时,基于扩散模型生成的数据分布更接近真实数据,具有应用于用户隐私保护的潜在价值.

       

      Abstract: In the field of data mining, the issue of data imbalance impacting model prediction accuracy is widespread, and also the issue of user privacy protection is neglected. Fake dataset generation has come to light as a crucial remedy for these problems. However, because of the traits of high-dimensional and irrelevant features, it is difficult to generate high-quality data in circumstances where structured data predominate. Considering the successful applications of the diffusion model in image generation task, we aim to apply the diffusion model for the task of customer churn prediction, which is a typical scenario in data mining. we utilize the Gaussian diffusion model and polynomial diffusion model to generate data for numerical and categorical features in customer churn data. Research and analysis have been conducted on the predictive performance and data privacy protection capabilities of our model. We conduct extensive experiments on customer churn data from multiple domains to explore the potential of fusing synthetic dataset and real dataset for data reconstruction. The results demonstrate that the diffusion model can generate high-quality data and improve the performance of various prediction methods, which can help alleviate the issue of data imbalance. Additionally, the data produced by the diffusion model exhibit a distribution that is quite similar to the original dataset, which may be useful for protecting user privacy.

       

    /

    返回文章
    返回