Abstract:
As a new area of natural language processing, topic tracking has received a lot of attentions from experts both at home and at broad, and has become more and more popular. Topic tracking is defined to be the task of monitoring a stream of news stories to find those that discuss the topic known to the system. Research is made into three key problems in the query-based topic tracking: feature extraction, feature weight computation, and similarity measure. Firstly, a feature extraction algorithm based on the combination of word differentiation and the location property is proposed. The basic idea of word differentiation is to divide words into capital words, whose initials are capital, and common words, whose initials are not capital. The location property decides the importance of words based on the inverse-pyramidal structure of the news stories. Secondly, a new method to compute the feature's weight based on the combination of several different feature extraction algorithms is proposed. This method gives the feature bigger weight, which is selected by more feature extraction algorithms. Finally, a double filtration algorithm based on the majority vote rule is proposed, which makes two judgments about the relativity of a story and a topic, and reduces the system's false alarm successfully. Experiments indicate that these three proposed methods not only perform well, but also have good scalability.