Abstract:
With the rapid development of World Wide Web, how to improve the efficiency and precision of Deep Web data extraction has already become more and more important for effective Deep Web data integration. However, the bottleneck problem of the improvement of efficiency and precision of Deep Web data extraction is repeatedly semantic annotating and the existing of nested attributes. The definition of result pattern is given, and a novel approach to Deep Web data extraction based on result pattern is proposed. The approach includes two stages which are result pattern generation and data extraction based on result pattern. According to the feature of Deep Web result pages, the definition of feature matrix of Web page data is given. By constructing and analyzing the feature matrix of Web page data, result pattern can be easily obtained. Attribute semantic annotating is completed during the stage of result pattern generation. In this way, repeatedly semantic annotating is resolved well. At the same time, an effective method to divide nested attributes is also proposed. Experimental results show that Deep Web data extraction based on result pattern improves the efficiency and precision, and lays a solid foundation for Deep Web data integration.