基于生成式对抗网络的结构化数据表生成模型

宋珂慧; 张莹; 张江伟; 袁晓洁

doi:10.7544/issn1000-1239.2019.20180353

基于生成式对抗网络的结构化数据表生成模型

A Generative Model for Synthesizing Structured Datasets Based on GAN

摘要

摘要: 在机器学习和数据库等领域，高质量数据集的合成一直以来是一个非常重要且充满挑战性的问题.其中，合成的高质量数据集可用来改善模型，尤其是深度学习模型的训练过程.一个健壮的模型训练过程需要大量已标注的数据集，获取这些数据集的一种方法是通过领域专家的手动标注，这种方法不仅代价大还容易出错，因此由模型自动合成高质量数据集的方法更为合理.近年来，由于计算机视觉领域的飞速发展，已经有不少致力于图像数据集合成的研究，但是这些模型不能直接应用在结构化数据表上，并且据调研，对这类数据的相关研究几乎没有.因此，提出了一个针对结构化数据表的生成模型TableGAN，该模型是生成式对抗网络(generative adversarial network, GAN)家族的一种变体，通过对抗训练的方式提高生成模型的性能.针对结构化数据的特征改变了传统GAN模型的内部结构，包括优化函数等，使其能够生成高质量的结构化数据用于改善模型的训练过程.通过在真实数据集上的大量实验表明了此模型的有效性，即在扩大后的数据集上训练模型的效果有明显提升.

Abstract: Synthesizing high quality dataset has been a long-standing challenge in both machine learning and database community. One of the applications of high quality dataset synthesis is to improve the model training, especially deep learning models. A robust model training process requires a large annotated dataset. One way of acquiring a large annotated training set is via the domain experts manual annotation, which is expensive and prone to mistakes. Therefore, as an alternative, automatic synthesis of high quality and similar dataset is much more plausible. Some efforts have been devoted for synthesizing image dataset due to the rapid development of computer vision. However, those models can not be applied to the structured data (numeric & categorical table) directly. Moreover, little efforts have been payed to the numeric & categorical table. Therefore, we propose TableGAN, the first generative model from GAN family, which improves the performance of the generative model with adversarial learning mechanism. TableGAN modifies the internal structure of traditional GAN targeting numeric & categorical table, including the optimization function, to synthesize more high-quality training dataset samples for improving the effectiveness of the training models. Extensive experiments on real datasets show significant performance improvement for those models trained on the enlarged training datasets, and thus verify the effectiveness of our TableGAN.

HTML全文

参考文献(0)

施引文献

资源附件(0)