ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (9): 1941-1953.doi: 10.7544/issn1000-1239.2015.20140533

Previous Articles     Next Articles

Mass of Short Texts Clustering and Topic Extraction Based on Frequent Itemsets

Peng Min1,2, Huang Jiajia1, Zhu Jiahui3, Huang Jimin1, Liu Jiping1   

  1. 1(Computer School, Wuhan University, Wuhan 430072); 2(Shenzhen Research, Wuhan University, Shenzhen, Guangdong 518057); 3(State Key Laboratory of Software Engineering(Wuhan University), Wuhan 430072)
  • Online:2015-09-01

Abstract: Short texts generated in social media have the characteristics of volume, velocity, low quality and variety, thus make the vector-space-based clustering methods face the challenges of high-dimensions, features sparsity and noisy disturbing. In this paper, we propose a short texts clustering and topic extraction (STC-TE) framework based on the frequent itemsets mined from the texts. This framework firstly studies the impact of multi-features on the short texts’ quality. Then, a large amount of frequent itemsets are dug out from the high quality short text set via setting a low support level, and a similar itemsets filtering strategy is devised to discard most of the unimportant frequent itemsets. Furthermore, based on the frequent itemsets similarity evaluated by relevant texts, we proposed a cluster self-adaptive spectral clustering (CSA_SC) algorithm to form the itemsets into different topic clusters. At last, the large-scale of short texts are classified into associated clusters according to the topic words extracted from the frequent itemset clusters. The framework is tested on one million of SinaWeibo dataset to evaluate the performance of the important frequent itemset selection and clustering, the topic words extraction, and the large scale of short texts classification. Experimental results show that the STC-TE framework can achieve topic extraction and large-scale short texts clustering with high accuracy.

Key words: large-scale, short texts, frequent itemsets, clustering, topic extraction

CLC Number: