基于合成数据预训练基础模型的表格数据聚类方法

李培文; 李飞江; 王婕婷; 钱宇华

doi:10.7544/issn1000-1239.202550405

基于合成数据预训练基础模型的表格数据聚类方法

Clustering Method for Tabular Data Based on Pretrained Foundation Models with Synthetic Data

摘要

摘要: 随着数据采集与数据存储技术的飞速发展，各行业收集并存储了大量无标记的表格数据. 聚类分析是挖掘这类数据潜在分组结构的重要方法. 目前，处理表格数据的聚类方法多数仍然是传统聚类算法. 深度学习技术和大模型技术主要用于处理非结构化的图像、文本、语音等数据类型，其强大的表示能力和推理能力在结构化的表格数据处理中仍难以发挥优势. 2025年，《Nature》刊发的TabPFN是一种可用于高效处理分类和回归任务的表格数据基础模型，为表格数据学习提供了新的基础. 受此启发，提出了一种基于合成数据预训练基础模型的表格数据聚类方法，主要包括预训练阶段和迭代推理阶段. 其中，预训练阶段基于传统数据聚类算法和TabPFN模型获得无标记表格数据的初始伪标签，迭代推理阶段基于微调后的TabPFN模型循环更新伪标签以得到聚类结果. 在基准数据集上的大量实验分析表明，改进方法显著提高了7种代表性聚类算法的性能.

Abstract: Driven by rapid advancements in data acquisition and storage technologies, vast amounts of unlabeled tabular data have been collected and stored across various industries. Clustering analysis serves as a fundamental method for uncovering the latent grouping structures within such data. Currently, traditional clustering algorithms still predominantly constitute the methods employed for processing this type of data. Deep learning techniques and large model technologies, known for their powerful representation and inference capabilities, are primarily employed for handling unstructured data types like images, text, and audio; their advantages remain largely underexploited in the context of structured tabular data processing. In 2025, Nature published TabPFN, a foundation model for tabular data capable of efficiently handling classification and regression tasks, offering a novel perspective for tabular data learning. Building upon TabPFN, this paper proposes a clustering method for tabular data based on pretrained foundation models with synthetic data. The method comprises two main phases: a pretraining phase and an iterative inference phase. The pretraining phase leverages traditional clustering algorithms and TabPFN model to obtain initial pseudo-labels for the unlabeled tabular data. The iterative inference phase then employs the fine-tuned TabPFN model to cyclically update these pseudo-labels until the final clustering results are obtained. Extensive experimental analyses conducted on benchmark datasets demonstrate that the proposed method significantly improves the clustering performance of seven representative clustering algorithms.

HTML全文

参考文献(46)

施引文献

资源附件(0)