Abstract:
How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the back-end one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL's corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs' corresponding anchor texts. Furthermore, the two classifiers' framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.