一种基于统计语义聚类的查询语言模型估计

蒲强; 何大庆; 杨国纬

一种基于统计语义聚类的查询语言模型估计

An Estimation of Query Language Model Based on Statistical Semantic Clustering

摘要

摘要: 如何有效生成文档聚类并使用聚类信息提高检索效果是信息检索中的重要研究课题.如果假设文档中存在若干隐含的独立主题,那么文档可以看成是由这些隐含的独立主题混合噪声相互作用的结果.基于这个假设提出了一种基于独立分量分析的语义聚类技术,试图借助于独立分量分析的良好主题区分能力,将一组文档按照实际隐含的主题在语义空间上聚类.在语言模型的框架下,语义主题聚类将由用户初始查询按照一定的度量方式激活.利用激活语义聚类的信息估计一个反馈语义主题模型,并与初始查询模型一起形成新的查询模型.在5个TREC数据集上的实验结果表明：基于统计语义聚类估计的查询模型相比传统的查询模型以及其他基于聚类的语言模型在检索性能上有显著性提高.其主要原因是应用了和用户查询最相似的语义聚类信息来估计查询模型.

Abstract: It is an important research direction in information retrieval to determine how to effectively generate clusters and use the information in clusters. Assuming that a document contains a set of independent hidden topics, a document is viewed as an interaction of independent hidden topics with some noise. A novel semantic clustering technique using independent component analysis is proposed according to this assumption. The perfect topic separation capability of independent component analysis will group a set of documents into different semantic clusters according to the hidden independent components in semantic space. Within language modeling framework, a certain semantic cluster is activated by a users initial query. A new query language model can be estimated by a users initial query model and a feedback semantic topic model which is estimated from the semantic cluster information in an activated semantic cluster. The estimated query model is applied in experiments on five TREC data sets. The experiment results show that the semantic cluster based query model can significantly improve retrieval performance over traditional query models and other cluster based language models. The main contribution of the improved performance comes from the estimation of query model on the semantic cluster that is most similar to a users query.

HTML全文

参考文献(0)

施引文献

资源附件(0)