ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2018, Vol. 55 ›› Issue (8): 1631-1640.doi: 10.7544/issn1000-1239.2018.20180233

Special Issue: 2018数据挖掘前沿进展专题

Previous Articles     Next Articles

A Distributed Representation Model for Short Text Analysis

Liang Jiye, Qiao Jie, Cao Fuyuan,Liu Xiaolin   

  1. (School of Computer and Information Technology, Shanxi University, Taiyuan 030006) (Key Laboratory of Computational Intelligence and Chinese Information Processing (Shanxi University), Ministry of Education, Taiyuan 030006)
  • Online:2018-08-01

Abstract: The distributed representation of short texts has become an important task in text mining. However, the direct application of the traditional Paragraph Vector may not be suitable, and the fundamental reason is that it does not make use of the information of corpus in training process, so it can not effectively improve the situation of insufficient contextual information in short texts. In view of this, in this paper we propose a novel distributed representation model for short texts called BTPV (biterm topic paragraph vector). BTPV adds the topic information of BTM (biterm topic model) to the Paragraph Vector model. This method not only uses the global information of corpus, but also perfects the implicit vector of Paragraph Vector with the explicit topic information of BTM. At last, we crawl popular news comments from the Internet as experimental data sets, using K-Means clustering algorithm to compare the models’ representation performance. Experimental results have shown that the BTPV model can get better clustering results compared with the common distributed representation models such as word2vec and Paragraph Vector, which indicates the advantage of the proposed model for short text analysis.

Key words: distributed representation, short text, document analysis, paragraph vector, biterm topic model (BTM)

CLC Number: