CBC-DS: 基于频繁闭模式的数据流分类算法
CBC-DS: A Classification Algorithm Based on Closed Frequent Patterns for Mining Data Streams
-
摘要: 基于关联规则的分类算法通常根据频繁模式生成类关联规则,但频繁模式挖掘易遭受组合爆炸问题,影响算法效率.并且数据流的出现也对分类算法提出了新的挑战.相对于频繁模式,频繁闭模式的数目较少,挖掘频繁闭模式的算法通常具有较高的效率.为此,提出了一种高效的基于频繁闭模式的数据流分类算法—CBC-DS.主要贡献在于:1)提出了一种基于逆文法顺序FP-Tree的频繁闭项集单遍挖掘过程,用于挖掘类关联规则,该过程采用了一种混合项顺序搜索策略以满足数据流挖掘的单遍性需求,并采用位图技术提高效率;2)提出了“自支持度”概念,用于筛选规则以提高算法分类精度.实验表明,位图技术能够提高算法速度2倍以上,利用自支持度能够提高算法平均精度0.5%左右;最终CBC-DS算法的平均分类精度比经典算法CMAR高1%左右,并且CBC-DS算法的规则挖掘速度远快于CMAR算法.Abstract: The classification algorithms based on association rules generally generate classification association rules by frequent patterns. As mining frequent patterns often suffer from the problem of combinatorial explosion, the efficiency of the algorithms is low. Moreover, the emergence of data streams has posed new challenges for classification algorithms. In contrast to frequent patterns, the number of closed frequent patterns is less, so that the efficiency of algorithms for mining closed frequent patterns is higher. A novel and efficient closed-frequent-patterns based classification algorithm, CBC-DS, is proposed for classifying data streams. The contributions are listed as follows: (1) a single-pass closed frequent itemsets mining process based on reverse lexicographic order FP-tree is introduced for mining classification association rules, which uses a kind of mixed item-ordering searching policy to satisfy the single-pass requirement of data streams and uses the bitmap technology to improve the efficiency; (2) the concept of self-support for filtering rules is proposed to improve the precision. The experimental results show that the bitmap technology can improve the efficiency of the algorithm about twice at least and the average classifying precision can be improved about 0.5% by using self-support. Eventually, the average precision of CBC-DS is about 1% higher than that of CMAR, and CBC-DS is much faster than CMAR.