信息检索中的带权邻近度度量研究

薛源海; 俞晓明; 刘悦; 关峰; 程学旗

doi:10.7544/issn1000-1239.2014.20130339

信息检索中的带权邻近度度量研究

Exploration of Weighted Proximity Measure in Information Retrieval

摘要

摘要: 信息检索需要解决的主要问题是为信息索取者提供相关、准确甚至完整的信息.大量的传统检索模型基于词袋假设进行建模，不考虑查询词之间的相互联系.词项邻近度信息在现有的研究中常被用于提升经典信息检索模型的检索效果，但大部分工作没有考虑查询中各个词重要性的差异.在现代信息检索的查询请求中，查询词之间不仅不完全相互独立，而且分别具有不同的重要程度.因此，在计算邻近度信息时对查询词的重要性进行区分，将有助于提高检索效果.带权邻近度BM25模型(WP-BM25)使用待检索数据集的背景信息对查询词的重要性进行区分，并将带权邻近度度量方法整合到BM25模型中.在TREC评测的3个标准数据集FR88-89，WT2G和WT10G上的一系列对比实验表明，该模型具有较好的鲁棒性，且能够使检索效果得到显著提升.

Abstract: A key problem of information retrieval is to provide information takers with relevant, accurate and even complete information. Lots of traditional information retrieval models are based on the bag-of-words assumption, without considering the implied associations among the query terms. Although term proximity has been widely used for boosting the performance of the classical information retrieval models, most of those efforts do not fully consider the different importance between the query terms. For queries in modern information retrieval, the query terms are not only dependent of each other, but also different in importance. Thus, computing the term proximity with taking into account the different importance of terms will be helpful to improve the retrieval performance. In order to achieve this, a weighted term proximity measure method is introduced, which distinguishes the significance of the query terms based on the collections to be searched. Weighted proximity BM25 model(WP-BM25) that integrating this method into the Okapi BM25 model is proposed to rank the retrieved documents. A large number of experiments are conducted on three standard TREC collections which are FR88-89, WT2G and WT10G. The results show that the weighted proximity BM25 model can significantly improve the retrieval performance, and it has good robustness.

HTML全文

参考文献(0)

施引文献

资源附件(0)