高级检索

    一种层次化的检索结果聚类方法

    A Hierarchical Search Result Clustering Method

    • 摘要: 检索结果聚类能够帮助用户快速地浏览搜索引擎返回的结果.传统的聚类方法由于不能生成有意义的类别标签因此是不适合的,为了改善检索结果层次化聚类的效果,采用了基于标签的聚类算法,提出了将DF、查询日志、查询词上下文特征融合的类别标签抽取算法,并以抽取的标签构造基础类别图,通过GBCA算法构建层次化聚类结果.实验证明了多特征融合模型的有效性;GBCA算法在类别标签抽取和F-Measure两个评价指标上都比STC和Snaket算法有很大的提高.

       

      Abstract: Search result clustering can help users quickly browse through the documents returned by search engine. Traditional clustering techniques are inadequate since they can not generate clusters with highly readable names. In order to improve the performance of the search result clustering and help user to quickly locate the relevant document, a label-based clustering method is used to make the search result clustering. A multi-feature integrated model is developed to extract base-cluster labels, which combines the DF, query log and query context features together. Using the extracted labels, some basic clusters are built. In order to setup a hierarchical clustering structure, a basic cluster relation graph is built based on these basic clusters. A hierarchical cluster structure is generated from the basic cluster relation graph using the graph based cluster algorithm (GBCA). To evaluate the search result clustering method, a test-bed is set up. P@N and F-Measure are introduced to evaluate the extracted labels and the document distribution in clusters. The experiment shows that the integrated label-extraction model is very effective. The more feature is used, the higher P@N can be gained. Compared with the STC and Snaket clustering method, GBCA outperforms the STC and Snaket in cluster label extraction and F-Measure.

       

    /

    返回文章
    返回