Abstract:
With the increase of Web databases, accessing Deep Web is becoming the main method to acquire information. Because of the large-scale unstructured content, heterogeneous result and dynamic data in Deep Web, there are some new challenges for entity extraction. Thus it is important to solve the problem of extracting the entities from Deep Web result pages effectively. By analyzing the characteristics of result pages, a DOM-tree based entity extraction mechanism for Deep Web (called D-EEM) is presented to solve the problem of entity extraction for Deep Web. D-EEM is modeled as three levels: expression level, extraction level, collection level. Therein the components of region location and semantic annotation are the core parts to be researched in this paper. A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and entity regions respectively, which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees. Also based on the Web context and co-occurrence, a semantic annotation method is proposed to benefit the process of data integration effectively. An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM. Compared with various entity extraction strategies, D-EEM is superior in the accuracy and efficiency of extraction.