A New Documents Clustering Method Based on Frequent Itemsets

Zhang Xuesong; Jia Caiyan

doi:10.7544/issn1000-1239.2018.20160662

Zhang Xuesong, Jia Caiyan. A New Documents Clustering Method Based on Frequent Itemsets[J]. Journal of Computer Research and Development, 2018, 55(1): 102-112. DOI: 10.7544/issn1000-1239.2018.20160662

Citation:

A New Documents Clustering Method Based on Frequent Itemsets

Graphical Abstract

Abstract

Abstract

Traditional document clustering methods use vector space model (VSM) of words to represent documents. This VSM representation only measures the importance of a single words, while ignores the semantic relationship between words, and has high dimensionality. In this study, we propose a new document clustering method: FIC (frequent itemsets based document clustering method). In the method, we use frequent itemsets (where a frequent itemset is a set of frequently co-occurred words) mined by FP-Growth algorithm in documents to represent each document. We then construct the document-document relationship network based on the similarity between pairs of documents at this new representation. At last, we divide the network into communities using a given community detection method to complete document clustering. Thereby, FIC can not only overcome the high dimensionality of VSM, but also fully make use of topological relationship among documents. The experimental results on two English corpora (Reters-21578 and 20Newsgroup) and one Chinese corpus (Sougou-News) demonstrate that the proposed method FIC is superior to the other existing frequent itemsets based methods and other classical state-of-the-art document clustering methods, and the top K words for characterizing each topic of documents identified by FIC are more meaningful than the classical topic model LDA (latent Dirichlet allocation).

FullText(HTML)

References (0)

Supplements (0)

Cited By

Turn off MathJax

Article Contents

A New Documents Clustering Method Based on Frequent Itemsets

Abstract

Catalog

Export File

Citation

Format

Content