ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (2): 512-521.doi: 10.7544/issn1000-1239.2015.20131336

• 信息处理 • 上一篇    下一篇

面向大规模微博消息流的突发话题检测

申国伟,杨武,王巍,于淼   

  1. (哈尔滨工程大学信息安全研究中心 哈尔滨 150001) (shenguowei@hrbeu.edu.cn)
  • 出版日期: 2015-02-01
  • 基金资助: 
    基金项目:国家“八六三”高技术研究发展计划基金项目(2012AA012802);国家自然科学基金项目(61170242)

Burst Topic Detection Oriented Large-Scale Microblogs Streams

Shen Guowei, Yang Wu, Wang Wei, Yu Miao   

  1. (Research Center of Information Security, Harbin Engineering University, Harbin 150001)
  • Online: 2015-02-01

摘要: 突发事件在微博中迅速传播,产生巨大的影响力,因此,突发舆情受到政府、企业的广泛关注.现有的突发话题检测算法只考虑单一的特征实体,无法处理微博中新词、图片、链接等诱导的突发.面向大规模微博消息流,提出一种无需中文分词的实时突发话题检测框架模型.模型依据消息流动态调整窗口大小,并通过传播影响力度量实体的突发权值.采用高阶联合聚类算法同时对实体、消息、用户进行聚类分析,在检测突发话题的同时,得到话题的关联消息及参与用户.对比实验结果表明,算法的准确性高,能够更早地检测到突发话题.

关键词: 突发话题检测, 微博, 联合聚类, 影响力, 大规模

Abstract: In microblogs, emergent events spread quickly and produce tremendous influence. Burst of public opinion is widely concerned by government and enterprise. Existing burst topic detection methods only consider one type of entity, such as word or tag. However, Chinese microblogs contain not only new or colloquial words, but also contain some pictures and links, burst patters of which are difficult to detect. To tackle this problem, we propose a real-time burst topic detection framework for multi-type entites. Different from existing method, our method does not require Chinese word segmentation, but generates new words lastly. In this framework,the window size is adjusted based on the microblogs streams dynamically. In order to measure the burst weight of entity, the spread influence of entity is calculated. Moreover, the high order co-clustering algorithm based on non-negative matrix decompostition is used to cluster two types of entities, message and user simultaneously. While the detection of burst topic, we can also obtain the related messages and participating users, which can be used to analyze the cause of burst topic. Experimental on a large Sina Weibo dataset show that our algorithm has higher accuracy and earlier detection of the burst topic compared with the existing algorithms.

Key words: burst topic detection, microblogs, co-clustering, influence, large scale

中图分类号: