Advanced Search
    Li Peiwen, Li Feijiang, Wang Jieting, Qian Yuhua. Clustering Method for Tabular Data Based on Pretrained Foundation Models with Synthetic Data[J]. Journal of Computer Research and Development, 2025, 62(9): 2139-2151. DOI: 10.7544/issn1000-1239.202550405
    Citation: Li Peiwen, Li Feijiang, Wang Jieting, Qian Yuhua. Clustering Method for Tabular Data Based on Pretrained Foundation Models with Synthetic Data[J]. Journal of Computer Research and Development, 2025, 62(9): 2139-2151. DOI: 10.7544/issn1000-1239.202550405

    Clustering Method for Tabular Data Based on Pretrained Foundation Models with Synthetic Data

    • Driven by rapid advancements in data acquisition and storage technologies, vast amounts of unlabeled tabular data have been collected and stored across various industries. Clustering analysis serves as a fundamental method for uncovering the latent grouping structures within such data. Currently, traditional clustering algorithms still predominantly constitute the methods employed for processing this type of data. Deep learning techniques and large model technologies, known for their powerful representation and inference capabilities, are primarily employed for handling unstructured data types like images, text, and audio; their advantages remain largely underexploited in the context of structured tabular data processing. In 2025, Nature published TabPFN, a foundation model for tabular data capable of efficiently handling classification and regression tasks, offering a novel perspective for tabular data learning. Building upon TabPFN, this paper proposes a clustering method for tabular data based on pretrained foundation models with synthetic data. The method comprises two main phases: a pretraining phase and an iterative inference phase. The pretraining phase leverages traditional clustering algorithms and TabPFN model to obtain initial pseudo-labels for the unlabeled tabular data. The iterative inference phase then employs the fine-tuned TabPFN model to cyclically update these pseudo-labels until the final clustering results are obtained. Extensive experimental analyses conducted on benchmark datasets demonstrate that the proposed method significantly improves the clustering performance of seven representative clustering algorithms.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return