高级检索

    基于质心向量的增量式主题爬行

    Centroid-Based Focused Crawler with Incremental Ability

    • 摘要: 研究如何在一个网页内部进行有选择的爬行.使用TFIDF-2模型以及Max, Ave, Sum三个启发式规则分别计算文档特征权重和质心特征权重,在此基础上构建与根集文档相对应的质心向量,利用它作为前端分类器指导主题爬行.使用前后端分类器分别给Frontier中的各个锚文本打分,将它们的打分求和,从中选择打分最高的链接,下载其对应的网页.实验结果表明,在质心向量的指导下,爬行程序借助于锚文本便可以准确地预测链接所指向网页的相关性;另外,双分类器框架还使得爬行策略具有增量爬行的能力.

       

      Abstract: How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the back-end one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL's corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs' corresponding anchor texts. Furthermore, the two classifiers' framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.

       

    /

    返回文章
    返回