ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2014, Vol. 51 ›› Issue (10): 2216-2224.doi: 10.7544/issn1000-1239.2014.20130339

• 信息处理 • 上一篇    下一篇

信息检索中的带权邻近度度量研究

薛源海1,2,3,俞晓明1,2,刘悦1,2,关峰1,2,3,程学旗1,2   

  1. 1(中国科学院网络数据科学与技术重点实验室 北京 100190);2(中国科学院计算技术研究所 北京 100190);3(中国科学院大学 北京 100190) (xueyuanhai@software.ict.ac.cn)
  • 出版日期: 2014-10-01
  • 基金资助: 
    国家“九七三”重点基础研究发展计划基金项目(2015CB358700);国家自然科学基金项目(60903107,61073071)

Exploration of Weighted Proximity Measure in Information Retrieval

Xue Yuanhai1,2,3, Yu Xiaoming1,2, Liu Yue1,2, Guan Feng1,2,3, Cheng Xueqi1,2   

  1. 1(Key Laboratory of Network Data Science and Technology, Chinese Academy of Sciences, Beijing 100190); 2(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190); 3(University of Chinese Academy of Sciences, Beijing 100190)
  • Online: 2014-10-01

摘要: 信息检索需要解决的主要问题是为信息索取者提供相关、准确甚至完整的信息.大量的传统检索模型基于词袋假设进行建模,不考虑查询词之间的相互联系.词项邻近度信息在现有的研究中常被用于提升经典信息检索模型的检索效果,但大部分工作没有考虑查询中各个词重要性的差异.在现代信息检索的查询请求中,查询词之间不仅不完全相互独立,而且分别具有不同的重要程度.因此,在计算邻近度信息时对查询词的重要性进行区分,将有助于提高检索效果.带权邻近度BM25模型(WP-BM25)使用待检索数据集的背景信息对查询词的重要性进行区分,并将带权邻近度度量方法整合到BM25模型中.在TREC评测的3个标准数据集FR88-89,WT2G和WT10G上的一系列对比实验表明,该模型具有较好的鲁棒性,且能够使检索效果得到显著提升.

关键词: 带权邻近度, 度量方法, BM25, 查询词重要性, 信息检索

Abstract: A key problem of information retrieval is to provide information takers with relevant, accurate and even complete information. Lots of traditional information retrieval models are based on the bag-of-words assumption, without considering the implied associations among the query terms. Although term proximity has been widely used for boosting the performance of the classical information retrieval models, most of those efforts do not fully consider the different importance between the query terms. For queries in modern information retrieval, the query terms are not only dependent of each other, but also different in importance. Thus, computing the term proximity with taking into account the different importance of terms will be helpful to improve the retrieval performance. In order to achieve this, a weighted term proximity measure method is introduced, which distinguishes the significance of the query terms based on the collections to be searched. Weighted proximity BM25 model(WP-BM25) that integrating this method into the Okapi BM25 model is proposed to rank the retrieved documents. A large number of experiments are conducted on three standard TREC collections which are FR88-89, WT2G and WT10G. The results show that the weighted proximity BM25 model can significantly improve the retrieval performance, and it has good robustness.

Key words: weighted proximity, measure method, BM25, term significance, information retrieval

中图分类号: