

    Tunneling Techniques for Focused Web Crawling

    • 摘要: 由于网络环境的复杂性和网页内容的多主题性,要想得到更多的特定主题相关网页,就要穿越那些主题不相关网页来获取更多的主题相关网页,即隧道穿越.将隧道穿越分为灰色隧道穿越和黑色隧道穿越.对于灰色隧道,在爬行过程中,将一个多主题Web页面分割成数量不多的内容块分别处理来避免由于网页整体主题不相关给该块所带来的影响.对于黑色隧道的穿越,将隧道中主题不相关网页根据其父亲页面的主题相关性赋予一个深度值,然后根据其深度值的大小进行取舍,来达到扩展主题爬行区域的目的.实验结果显示,这两种方法都达到了预期效果,所以方法是有效、稳健和实用的.


      Abstract: Due to the complexity of the Web environment and topic-multiplicity of the contents of Web pages, it is quite difficult to get all the Web pages relevant to a specific topic. It is possible for an irrelevant Web page to link a relevant Web page, so it is required to traverse the irrelevant Web page to get more relevant pages. This procedure is called tunneling. In this paper, some research about tunneling technique is presented, and also presented is a correction to the previous results. Tunneling is partitioned into grey tunneling and black tunneling. During the process of crawling, in order to avoid the effect caused by the Web page that is irrelevant to the specific topic as a whole but relevant partially, a multi-topical page is divided into several blocks and the blocks are processed individually for grey tunneling. In black tunneling, a depth value is assigned to determine whether the page should be kept to each irrelevant page according to the relevance of its father page, and then the scope of the topical crawler can be broadened. The experimental results show that the two tunneling methods have achieved the effect expected. Accordingly, the approaches are effective, robust and practicable.


