Abstract:
With the widespread use of XML data stream, discovering knowledge from it becomes important. Compared with other frequent pattern mining, mining frequent subtree over large-scale XML documents and unlimited growing XML data stream is facing difficulties: data steam can not be resolved in memory as a whole, and mining partitioned XML data stream must be considered semi-structured characteristics of XML data, etc. Inspired by this fact, Tmlist is proposed for mining frequent subtrees over paging XML data stream. Tmlist pages XML data stream, manages cross-page nodes and frequent candidate subtrees growing across page, and mines frequent subtrees page-by-page. Frequent candidate subtrees grow by inserting frequent candidate nodes in their rightmost path according to the level of their roots, avoiding the repeated recursive growth of the subtrees rooted by the low-level nodes. A subtree is represented by the topologic sequence of its rightmost path, which avoids the prefix match for the increment of subtrees, so the storing and matching cost for the prefix nodes is cut. Frequent candidate subtrees are selected according to the page minimum support, the support of frequent subtrees is decayed and branches are pruned according to the decaying factor. Accordingly, Tmlist reduces the memory cost of mining frequent subtrees in the limit of error and improves memory utilization and mining efficiency.