Abstract:
In the field of data mining, the issue of data imbalance impacting model prediction accuracy is widespread, and also the issue of user privacy protection is neglected. Fake dataset generation has come to light as a crucial remedy for these problems. However, because of the traits of high-dimensional and irrelevant features, it is difficult to generate high-quality data in circumstances where structured data predominate. Considering the successful applications of the diffusion model in image generation task, we aim to apply the diffusion model for the task of customer churn prediction, which is a typical scenario in data mining. we utilize the Gaussian diffusion model and polynomial diffusion model to generate data for numerical and categorical features in customer churn data. Research and analysis have been conducted on the predictive performance and data privacy protection capabilities of our model. We conduct extensive experiments on customer churn data from multiple domains to explore the potential of fusing synthetic dataset and real dataset for data reconstruction. The results demonstrate that the diffusion model can generate high-quality data and improve the performance of various prediction methods, which can help alleviate the issue of data imbalance. Additionally, the data produced by the diffusion model exhibit a distribution that is quite similar to the original dataset, which may be useful for protecting user privacy.