Abstract:
Web information extraction is an important task of Web mining. Various applications could benefit from the advancement in this area. These applications include semantic Web, vertical search, sentiment analysis, etc. Current techniques require lots of human interaction which preclude the universal application of Web information extraction. To automate the extraction process, recent research works identify specific features of special domains and extract information by machine learning techniques. However, because of the dependence on specific features, it is very difficult to extend such methods to other domains. In this paper, the Web information extraction problem is analyzed and a subtask is proposed. This new subtask is called named attribute extraction task. Statistics results from multiple datasets prove that named attribute extraction task covers more than 60% attributes in these domains, which show the importance of this subtask. Named attributes are attributes of objects which are encoded in the name-value pair form. That is, the names and values of attributes are settled nearby in the Web pages. Therefore, once the names of attributes are located, the values can be extracted automatically. In this paper, an extended domain model is proposed to summarize attribute names of a domain. And an information extraction method based on this model is developed. Experiments show that the method can extract named attributes at the precision 80%, and at the recall higher than 90%.