Centroid-Based Focused Crawler with Incremental Ability

Wang Hui; Zuo Wanli; Wang Huiyu; Ning Aijun; Sun Zhiwei; Man Chunlei

Wang Hui, Zuo Wanli, Wang Huiyu, Ning Aijun, Sun Zhiwei, Man Chunlei. Centroid-Based Focused Crawler with Incremental AbilityJ. Journal of Computer Research and Development, 2009, 46(2): 217-224.

Citation:

Wang Hui, Zuo Wanli, Wang Huiyu, Ning Aijun, Sun Zhiwei, Man Chunlei. Centroid-Based Focused Crawler with Incremental AbilityJ. Journal of Computer Research and Development, 2009, 46(2): 217-224.

Citation:

Wang Hui, Zuo Wanli, Wang Huiyu, Ning Aijun, Sun Zhiwei, Man Chunlei. Centroid-Based Focused Crawler with Incremental AbilityJ. Journal of Computer Research and Development, 2009, 46(2): 217-224.

Centroid-Based Focused Crawler with Incremental Ability

Graphical Abstract

Abstract

Abstract

How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the back-end one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL's corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs' corresponding anchor texts. Furthermore, the two classifiers' framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

Centroid-Based Focused Crawler with Incremental Ability

Abstract

Catalog

Export File

Citation

Format

Content