高级检索

    基于结果模式的Deep Web数据抽取

    Deep Web Data Extraction Based on Result Pattern

    • 摘要: 高效、准确地获取Deep Web数据是实现Deep Web数据集成系统的关键问题,然而重复语义标注、嵌套属性的存在是Deep Web数据抽取效率和准确率难以提升的瓶颈问题.因此提出基于结果模式的Deep Web数据抽取机制,将数据抽取工作分为结果模式生成和数据抽取两个阶段,属性语义标注放在结果模式生成阶段来完成,有效解决了重复语义标注问题;同时针对嵌套属性问题,提出一种有效的解决方法.与同类成果相比,基于结果模式的数据抽取方法提高了数据抽取的准确率及效率,并且为Deep Web数据集成奠定了良好的基础.

       

      Abstract: With the rapid development of World Wide Web, how to improve the efficiency and precision of Deep Web data extraction has already become more and more important for effective Deep Web data integration. However, the bottleneck problem of the improvement of efficiency and precision of Deep Web data extraction is repeatedly semantic annotating and the existing of nested attributes. The definition of result pattern is given, and a novel approach to Deep Web data extraction based on result pattern is proposed. The approach includes two stages which are result pattern generation and data extraction based on result pattern. According to the feature of Deep Web result pages, the definition of feature matrix of Web page data is given. By constructing and analyzing the feature matrix of Web page data, result pattern can be easily obtained. Attribute semantic annotating is completed during the stage of result pattern generation. In this way, repeatedly semantic annotating is resolved well. At the same time, an effective method to divide nested attributes is also proposed. Experimental results show that Deep Web data extraction based on result pattern improves the efficiency and precision, and lays a solid foundation for Deep Web data integration.

       

    /

    返回文章
    返回