Abstract:
Entity search is a promising research topic because it will provide Web information in detail to the users. A key problem of entity search is collecting Web pages quickly and completely for the relevant entities on a specific domain. To deal with this issue, a website is modeled as a graph on a set of connected important states. Then a novel algorithm named LPC is proposed to learn the optimal link sequences leading to the goal pages which entities are embedded in. The LPC algorithm uses a two-stage strategy. In the first stage, it uses an undirected graphical learning model CRF to capture sequential link patterns leading to goal pages. The conditional exponential models of CRF are able to exploit a variety of features including state and transition features extracted around hyperlinks and HTML pages. In the second stage, the links in the crawling frontier queue are prioritized based on reinforcement learning and the trained CRF model. A discount reward approach from reinforcement learning is employed to compute the reward score using the CRF model learnt during path classification phase. The experimental results on massive real data show that the optimal prediction ability of CRF helps LPC outperforms other focused crawlers.