• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Li Shijun, Yu Junqing, Ou Weijie. Web Information Extraction Based on HTML Pattern Algebra[J]. Journal of Computer Research and Development, 2006, 43(9): 1644-1650.
Citation: Li Shijun, Yu Junqing, Ou Weijie. Web Information Extraction Based on HTML Pattern Algebra[J]. Journal of Computer Research and Development, 2006, 43(9): 1644-1650.

Web Information Extraction Based on HTML Pattern Algebra

More Information
  • Published Date: September 14, 2006
  • Generating wrapper efficiently for extracting Web data has broad application prospect, but is also a difficult problem that is not yet solved efficiently till now. To tackle this problem, a pattern algebra for HTML documents is introduced, which includes key concepts, such as consistent pattern set, and the addition operation of pattern, and based on it a new approach to extract Web information is presented. It induces the consistent pattern set which represents identifying rules of each attribute by exploring the whole samples, and then extracts data by the consistent pattern set with multiple patterns. It can apply Web pages with tabular structure, in which there are missing attributes or attributes with multiple values or different order and hierarchical structure, and has been validated experimentally in the prototype.
  • Related Articles

    [1]Wang Xiaoxi, Liu Qixu, Liu Chaoge, Zhang Fangjiao, Liu Xinyu, Cui Xiang. Survey of Web Tracking[J]. Journal of Computer Research and Development, 2023, 60(4): 839-859. DOI: 10.7544/issn1000-1239.202110681
    [2]Yu Wei, Li Shijun, Yang Sha, Hu Yahui, Liu Jing, Ding Yonggang, Wang Qian. Automatically Discovering of Inconsistency Among Cross-Source Data Based on Web Big Data[J]. Journal of Computer Research and Development, 2015, 52(2): 295-308. DOI: 10.7544/issn1000-1239.2015.20140224
    [3]Zhang Xianchao, Xu Wen, Gao Liang, and Liang Wenxin. Combining Content and Link Analysis for Local Web Community Extraction[J]. Journal of Computer Research and Development, 2012, 49(11): 2352-2358.
    [4]Ma Anxiang, Zhang Bin, Gao Kening, Qi Peng, and Zhang Yin. Deep Web Data Extraction Based on Result Pattern[J]. Journal of Computer Research and Development, 2009, 46(2): 280-288.
    [5]Ban Zhijie, Gu Zhimin, Jin Yu. A Survey of Web Prefetching[J]. Journal of Computer Research and Development, 2009, 46(2): 202-210.
    [6]Deng Xiaopeng, Xing Chunxiao, Cai Lianhong. Progress in Testing for Web Applications[J]. Journal of Computer Research and Development, 2007, 44(8): 1273-1283.
    [7]Xue Xiaobing, Han Jieling, Jiang Yuan, and Zhou Zhihua. Link Recommendation in Web Index Page Based on Multi-Instance Learning Techniques[J]. Journal of Computer Research and Development, 2007, 44(3).
    [8]Qin Zheng, Zhang Ling, Li Na. Application of an Improved PageRank in Web Crawler[J]. Journal of Computer Research and Development, 2006, 43(6): 1044-1049.
    [9]Wang Bennian, Gao Yang, Chen Shifu, Xie Junyuan. A Review of Web Intelligence Research[J]. Journal of Computer Research and Development, 2005, 42(5): 721-727.
    [10]Yang Nan, Gong Danzhi, Li Xian, and Meng Xiaofeng. Survey of Web Communities Identification[J]. Journal of Computer Research and Development, 2005, 42(3): 1.

Catalog

    Article views (647) PDF downloads (543) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return