Abstract:
Web forum is one of the major types of social media in Web 2.0. However, the generated contents in Web forums can vary in quality, ranging from excellent detailed opinions to topic drift contents or swear words. Therefore, a novel LDA (latent Dirichlet allocation) based approach is proposed in this paper to detect low-quality posts in Web forums. Compared with previous methods, the new one uses both semantic and statistic features of a post to evaluate its quality. The semantic features include JunkInsignificant (JI) topic proportion, topic uncertainty and topic relevance, which are computed in LDA topic space in order to overcome the ineffectiveness of TF·IDF based features in short texts. An LDA model is firstly built to predict the topic distribution of each post. Then, semantic features of a post are computed based on its topic distribution. The statistic features contain surface, syntactic and forum specific features of posts, which are selected based on the analysis of the posts contents. Since detecting the low-quality posts can be considered as a bi-classification problem, SVM is used to filter the low-quality posts. Experimental results on three different datasets show that the new approach outperforms the previous ones in terms of precision, recall and F\-1 values.