ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2018, Vol. 55 ›› Issue (1): 102-112.doi: 10.7544/issn1000-1239.2018.20160662

Previous Articles     Next Articles

A New Documents Clustering Method Based on Frequent Itemsets

Zhang Xuesong, Jia Caiyan   

  1. (Beijing Key Lab of Traffic Data Analysis and Mining (Beijing Jiaotong University), Beijing 100044) (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044)
  • Online:2018-01-01

Abstract: Traditional document clustering methods use vector space model (VSM) of words to represent documents. This VSM representation only measures the importance of a single words, while ignores the semantic relationship between words, and has high dimensionality. In this study, we propose a new document clustering method: FIC (frequent itemsets based document clustering method). In the method, we use frequent itemsets (where a frequent itemset is a set of frequently co-occurred words) mined by FP-Growth algorithm in documents to represent each document. We then construct the document-document relationship network based on the similarity between pairs of documents at this new representation. At last, we divide the network into communities using a given community detection method to complete document clustering. Thereby, FIC can not only overcome the high dimensionality of VSM, but also fully make use of topological relationship among documents. The experimental results on two English corpora (Reters-21578 and 20Newsgroup) and one Chinese corpus (Sougou-News) demonstrate that the proposed method FIC is superior to the other existing frequent itemsets based methods and other classical state-of-the-art document clustering methods, and the top K words for characterizing each topic of documents identified by FIC are more meaningful than the classical topic model LDA (latent Dirichlet allocation).

Key words: document clustering, frequent itemsets, complex network, community division, text representation model

CLC Number: