Abstract:
Search result clustering can help users quickly browse through the documents returned by search engine. Traditional clustering techniques are inadequate since they can not generate clusters with highly readable names. In order to improve the performance of the search result clustering and help user to quickly locate the relevant document, a label-based clustering method is used to make the search result clustering. A multi-feature integrated model is developed to extract base-cluster labels, which combines the DF, query log and query context features together. Using the extracted labels, some basic clusters are built. In order to setup a hierarchical clustering structure, a basic cluster relation graph is built based on these basic clusters. A hierarchical cluster structure is generated from the basic cluster relation graph using the graph based cluster algorithm (GBCA). To evaluate the search result clustering method, a test-bed is set up. P@N and F-Measure are introduced to evaluate the extracted labels and the document distribution in clusters. The experiment shows that the integrated label-extraction model is very effective. The more feature is used, the higher P@N can be gained. Compared with the STC and Snaket clustering method, GBCA outperforms the STC and Snaket in cluster label extraction and F-Measure.