基于扩展自然序树的概化关联规则增量挖掘方法

毛宇星  施伯乐

基于扩展自然序树的概化关联规则增量挖掘方法

毛宇星施伯乐

An Incremental Method for Mining Generalized Association Rules Based on Extended Canonical-Order Tree

Mao Yuxing and Shi Baile

摘要

摘要: 概化关联规则挖掘作为数据挖掘领域一个重要的拓展性研究课题，首先提出了一种概化扩展自然序树(generalized extended canonical-order tree, GECT)结构及其增量挖掘算法GECT-IM.该算法对原始分类事务数据库只扫描一次，就可以将所有交易信息映射至一棵压缩格式的GECT，然后通过对更新交易数据集扫描得到更新数据集中各项集的计数，结合相关性质及运算就可以发现大部分更新后的概化频繁项集；其次，针对GECT规模较大以及GECT-IM算法仍然可能需要遍历初始GECT树的局限，在界定数据库更新和重构概念的基础上，基于一种可量化度量的准最小支持度阈值，提出了一种改进的准频繁概化扩展自然序树(pre-large generalized extended canonical-order tree, PGECT)结构及其增量挖掘算法PGECT-IM.由于有效避免了对初始GECT进行遍历的情形，从而进一步提升了概化关联规则增量挖掘效率.实验证明，提出的概化关联规则增量挖掘算法GECT-IM及其优化算法PGECT-IM，比现有增量挖掘算法具有更高的挖掘效率和更好的扩展性.

Abstract: Mining generalized association rules is one of the important research areas in data mining. In real life applications, the transaction database is updated frequently. It makes the maintenance of generalized association rules one of challenging research work. In this paper, firstly, by analyzing and summarizing all the update cases of the taxonomy data, several special properties of updating are concluded; secondly, we project the transaction databases to a new compact structure called GECT (generalized extended canonical-order tree) composed of two header tables that can be used to mine the whole updated tree and the incremental tree. Thirdly, we propose an incremental updating algorithm GECT-IM, which finds most updated frequent itemsets by scanning the updating transactions set instead of the original database; To tackle the limit of GECT-IM, which still need scan the GECT when the infrequent itemsets become frequent, we propose a further optimized structure called PGECT (pre-large generalized extended canonical-order tree) and an efficient algorithm PGECT-IM. Within the certain updating scope, it can find all the updated frequent itemsets without rescanning the original PGECT. The experiments on synthetic datasets show that our algorithms, both GECT-IM and PGECT-IM, are not only correct and complete but also outperform the well-known and current algorithms.

HTML全文

参考文献(0)

施引文献

资源附件(0)