Abstract:
This paper proposes a method KEE for evaluating entity extraction problem over XML data, which is an important step for identifying entities in XML data. Directed by the XML Key, utilizing the relaxation and verification techniques, KEE provides a rule-based solution for entity extraction problem, which has following characteristics. Firstly, using XML query language, KEE provides a condensed presentation for the entity whose size may get very large when scaling up the data size. Secondly, requiring only one location example to indicate the interests, using relaxation technique, KEE can discover other similar locations automatically. Thirdly, by adjusting the example given to KEE, users can specify their own interesting entity locations and control the locations discovered by KEE. Besides, utilizing the idea of sharing computations, by extending previous automaton techniques for XML query evaluation, an efficient implementation of KEE is provided. Experimental results on both synthetic and real data show that KEE can provide an effective and efficient solution to the entity extraction problem.