低频查询的用户行为分析和类别研究
Empirical Study on Rare Query Categorization
-
摘要: 低频查询是用户提交查询频次非常低的查询.它们占了搜索引擎独立查询的很大比例且对用户体验影响巨大,但由于数据稀疏性,现有的搜索引擎用户行为分析及相关研究中对低频查询涉及很少.结合前人的相关工作,使用商业搜索引擎的大规模用户日志,在会话级别上进行低频查询的用户行为分析以及类别研究.基于目标查询行为、后续相关查询行为、整体会话行为3个方面的12个特征进行了低频查询的用户行为特征分析,首次提出了低频查询类别分析框架,并进一步使用改进的AdaBoost算法对低频查询会话进行分类.实验对2 000个随机的低频查询会话样例进行分类,AUC值达到了83%以上.低频查询的用户行为分析和类别研究,将为搜索引擎用户行为分析等网络检索研究提供重要基础.Abstract: Rare queries are those users submit to search engines very infrequently. They occupy a large fraction of different queries and affect users experience greatly. But little work has been done on rare queries in existing user behavior analysis due to the data sparseness problem. In this paper we make an empirical study on characterizing user behaviors on rare queries and obtain an overview of rare query composition. Large scale search logs collected from a commercial search engine are used. Based on the analysis of several features involving behaviors in goal query, related queries and entire session, we propose a semi-supervised categorization framework and use a modified AdaBoost to classify rare sessions. The results are evaluated on 2 000 randomly sampled rare sessions and the average AUC value is over 83%. This work will be helpful for Web search study including user behavior analysis concerning rare queries.