Abstract:
Driven by rapid advancements in data acquisition and storage technologies, vast amounts of unlabeled tabular data have been collected and stored across various industries. Clustering analysis serves as a fundamental method for uncovering the latent grouping structures within such data. Currently, traditional clustering algorithms still predominantly constitute the methods employed for processing this type of data. Deep learning techniques and large model technologies, known for their powerful representation and inference capabilities, are primarily employed for handling unstructured data types like images, text, and audio; their advantages remain largely underexploited in the context of structured tabular data processing. In 2025, Nature published TabPFN, a foundation model for tabular data capable of efficiently handling classification and regression tasks, offering a novel perspective for tabular data learning. Building upon TabPFN, this paper proposes a clustering method for tabular data based on pretrained foundation models with synthetic data. The method comprises two main phases: a pretraining phase and an iterative inference phase. The pretraining phase leverages traditional clustering algorithms and TabPFN model to obtain initial pseudo-labels for the unlabeled tabular data. The iterative inference phase then employs the fine-tuned TabPFN model to cyclically update these pseudo-labels until the final clustering results are obtained. Extensive experimental analyses conducted on benchmark datasets demonstrate that the proposed method significantly improves the clustering performance of seven representative clustering algorithms.