基于查询向量的英语话题跟踪研究

赵  华  赵铁军  于  浩  郑德权

基于查询向量的英语话题跟踪研究

赵华赵铁军于浩郑德权

English Topic Tracking Research Based on Query Vector

Zhao Hua, Zhao Tiejun, Yu Hao, and Zheng Dequan

摘要

摘要: 通过分析英语新闻报道的特点，提出了一种基于词汇区分和位置特征相结合的特征项抽取算法.词汇区分是指将单词分为首字母是大写的单词和首字母不是大写的单词，位置特征利用新闻报道的倒金字塔式的结构特点决定单词的重要性.提出了一种基于多个特征项抽取算法融合的特征项权值计算方法，该方法认为被越多的特征项抽取算法选中的特征项越重要.提出了一种基于多数投票策略的双重过滤算法，对报道和话题是否相关进行两次过滤，大大降低了系统的误报率.实验表明提出的3种算法不但取得了很好的效果，而且具有很好的可扩展性.

Abstract: As a new area of natural language processing, topic tracking has received a lot of attentions from experts both at home and at broad, and has become more and more popular. Topic tracking is defined to be the task of monitoring a stream of news stories to find those that discuss the topic known to the system. Research is made into three key problems in the query-based topic tracking: feature extraction, feature weight computation, and similarity measure. Firstly, a feature extraction algorithm based on the combination of word differentiation and the location property is proposed. The basic idea of word differentiation is to divide words into capital words, whose initials are capital, and common words, whose initials are not capital. The location property decides the importance of words based on the inverse-pyramidal structure of the news stories. Secondly, a new method to compute the feature's weight based on the combination of several different feature extraction algorithms is proposed. This method gives the feature bigger weight, which is selected by more feature extraction algorithms. Finally, a double filtration algorithm based on the majority vote rule is proposed, which makes two judgments about the relativity of a story and a topic, and reduces the system's false alarm successfully. Experiments indicate that these three proposed methods not only perform well, but also have good scalability.

HTML全文

参考文献(0)

施引文献

资源附件(0)