Advanced Search
    Zhang Xuesong, Jia Caiyan. A New Documents Clustering Method Based on Frequent Itemsets[J]. Journal of Computer Research and Development, 2018, 55(1): 102-112. DOI: 10.7544/issn1000-1239.2018.20160662
    Citation: Zhang Xuesong, Jia Caiyan. A New Documents Clustering Method Based on Frequent Itemsets[J]. Journal of Computer Research and Development, 2018, 55(1): 102-112. DOI: 10.7544/issn1000-1239.2018.20160662

    A New Documents Clustering Method Based on Frequent Itemsets

    • Traditional document clustering methods use vector space model (VSM) of words to represent documents. This VSM representation only measures the importance of a single words, while ignores the semantic relationship between words, and has high dimensionality. In this study, we propose a new document clustering method: FIC (frequent itemsets based document clustering method). In the method, we use frequent itemsets (where a frequent itemset is a set of frequently co-occurred words) mined by FP-Growth algorithm in documents to represent each document. We then construct the document-document relationship network based on the similarity between pairs of documents at this new representation. At last, we divide the network into communities using a given community detection method to complete document clustering. Thereby, FIC can not only overcome the high dimensionality of VSM, but also fully make use of topological relationship among documents. The experimental results on two English corpora (Reters-21578 and 20Newsgroup) and one Chinese corpus (Sougou-News) demonstrate that the proposed method FIC is superior to the other existing frequent itemsets based methods and other classical state-of-the-art document clustering methods, and the top K words for characterizing each topic of documents identified by FIC are more meaningful than the classical topic model LDA (latent Dirichlet allocation).
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return