基于链接路径预测的聚焦Web实体搜索

黄健斌; 孙鹤立

基于链接路径预测的聚焦Web实体搜索

Focused Web Entity Search Using the Linked-Path Prediction Model

摘要

摘要: 实体搜索是一个有前景的研究领域，因为它能够为用户提供更为详细的Web信息.快速、完全地收集特定领域实体所在的网页是实体搜索中的一个关键问题.为了解决这个问题，将Web网站建模为一组互连的状态构成的图，提出一种链接路径预测学习算法LPC，该模型能够学习大型网站中从主页通向目标网页的最优路径，从而指导爬虫快速定位到含有Web实体的目标网页.LPC算法分为两个阶段：首先，使用概率无向图模型CRF，学习从网站主页通往目标网页的链接路径模型，CRF模型能够融合超连接和网页中的各种特征，包括状态特征和转移特征；其次，结合增强学习技术和训练的CRF模型对爬行前端队列的超链接进行优先级评分.一种来自增强学习的折扣回报方法通过利用路径分类阶段学习的CRF模型来计算连接的回报值。在多个领域大量真实数据上的实验结果表明，所提出的适用CRF模型指导的链接路径预测爬行算法LPC的性能明显优于其他聚焦爬行算法.

Abstract: Entity search is a promising research topic because it will provide Web information in detail to the users. A key problem of entity search is collecting Web pages quickly and completely for the relevant entities on a specific domain. To deal with this issue, a website is modeled as a graph on a set of connected important states. Then a novel algorithm named LPC is proposed to learn the optimal link sequences leading to the goal pages which entities are embedded in. The LPC algorithm uses a two-stage strategy. In the first stage, it uses an undirected graphical learning model CRF to capture sequential link patterns leading to goal pages. The conditional exponential models of CRF are able to exploit a variety of features including state and transition features extracted around hyperlinks and HTML pages. In the second stage, the links in the crawling frontier queue are prioritized based on reinforcement learning and the trained CRF model. A discount reward approach from reinforcement learning is employed to compute the reward score using the CRF model learnt during path classification phase. The experimental results on massive real data show that the optimal prediction ability of CRF helps LPC outperforms other focused crawlers.

HTML全文

参考文献(0)

施引文献

资源附件(0)