• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Wang Hui, Zuo Wanli, Wang Huiyu, Ning Aijun, Sun Zhiwei, Man Chunlei. Centroid-Based Focused Crawler with Incremental Ability[J]. Journal of Computer Research and Development, 2009, 46(2): 217-224.
Citation: Wang Hui, Zuo Wanli, Wang Huiyu, Ning Aijun, Sun Zhiwei, Man Chunlei. Centroid-Based Focused Crawler with Incremental Ability[J]. Journal of Computer Research and Development, 2009, 46(2): 217-224.

Centroid-Based Focused Crawler with Incremental Ability

More Information
  • Published Date: February 14, 2009
  • How to crawl selectively in a Web page is studied in this paper. Document feature weight and centroid feature weight are calculated based on the proposed TFIDF-2 model and the three heuristic rules Max, Ave, and Sum. After these two weights are figured out, a centroid vector which corresponds to a root set can be easily constructed. The centroid vector is then used as a front-end classifier to guide a focused crawler. First of all, the authors use the front-end classifier and the back-end one respectively to score anchor texts of URLs. Then, they sum up the two anchor text scores of the same URL. Finally, they select the URL which has the highest anchor text score from the frontier and download the URL's corresponding Web page. Four series experiments are conducted. Experimental results show that with the aid of newly constructed centroid vector, the focused crawler can efficiently and accurately predict the relevance of a Web page simply by using URLs' corresponding anchor texts. Furthermore, the two classifiers' framework contributes to the focused crawler an incremental crawling ability, which is one of the most important and interesting features and must be settled down in the domain of focused crawling.
  • Related Articles

    [1]Yan Zhiyuan, Xie Biwei, Bao Yungang. HVMS: A Hybrid Vectorization-Optimized Mechanism of SpMV[J]. Journal of Computer Research and Development, 2024, 61(12): 2969-2984. DOI: 10.7544/issn1000-1239.202330204
    [2]Xiao Ke, Dai Shun, He Yunhua, Sun Limin. Chinese Text Extraction Method of Natural Scene Images Based on City Monitoring[J]. Journal of Computer Research and Development, 2019, 56(7): 1525-1533. DOI: 10.7544/issn1000-1239.2019.20180543
    [3]Miao Xiaoxiao, Xu Ji, Wang Jian. Denoising Autoencoder-Based Language Feature Compensation[J]. Journal of Computer Research and Development, 2019, 56(5): 1082-1091. DOI: 10.7544/issn1000-1239.2019.20180471
    [4]Liang Jiye, Qiao Jie, Cao Fuyuan, Liu Xiaolin. A Distributed Representation Model for Short Text Analysis[J]. Journal of Computer Research and Development, 2018, 55(8): 1631-1640. DOI: 10.7544/issn1000-1239.2018.20180233
    [5]Zhou Zhiping, Zhu Shuwei, Zhang Daowen. Multiobjective Clustering Algorithm with Fuzzy Centroids for Categorical Data[J]. Journal of Computer Research and Development, 2016, 53(11): 2594-2606. DOI: 10.7544/issn1000-1239.2016.20150467
    [6]Hu Wenjun, Wang Shitong, Tao Jianwen. Maximum Vector-Angular Margin Kernel Classification[J]. Journal of Computer Research and Development, 2012, 49(4): 770-776.
    [7]Ling Ping, Wang Zhe, Zhou Chunguang, Huang Lan. Reduced Support Vector Clustering[J]. Journal of Computer Research and Development, 2010, 47(8): 1372-1381.
    [8]Qiao Lishan, Chen Songcan, Wang Min. Image Thresholding Based on Relevance Vector Machine[J]. Journal of Computer Research and Development, 2010, 47(8): 1329-1337.
    [9]Liu Bo, Wang Zhensong, Yao Ping, Li Mingfeng. A Novel Real-Time Doppler Centroid Estimating Algorithm[J]. Journal of Computer Research and Development, 2005, 42(11): 1911-1917.
    [10]Tang Huanling, Sun Jiantao, Lu Yuchang. A Weight Adjustment Technique with Feature Weight Function Named TEF-WA in Text Categorization[J]. Journal of Computer Research and Development, 2005, 42(1): 47-53.

Catalog

    Article views (764) PDF downloads (565) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return