一种改进的频繁闭项集挖掘算法

宋  威; 杨炳儒; 徐章艳; 高  静

一种改进的频繁闭项集挖掘算法

An Improved Algorithm for Mining Frequent Closed Itemsets

摘要

摘要: 频繁闭项集惟一确定频繁项集且规模小得多,但挖掘频繁闭项集仍是很费时的.为提高挖掘效率,提出了一种改进的频繁闭项集挖掘算法DCI-Closed-Index. 该算法用“索引数组”来组织数据,通过为每个项目增加包含索引,找到频繁共同出现的项集.利用二进制位图技术,给出了一个求包含索引的快速算法.然后根据项目在包含索引中出现的频率由高到低进行排序,并利用包含索引作为启发信息,合并同时出现且支持度相等的频繁项,得到初始生成子,从而大大缩小了搜索空间.同时利用索引数组对每一个生成子的前序集和后序集进行约简,得到新的、较小的约简前序集和约简后序集.并证明了约简前序集和后序集与原来的前序集和后序集的功能是一样的.从而减少了候选生成子的集合包含判断的操作.实验结果表明,该算法的性能优于其他主流算法.

Abstract: The set of frequent closed itemsets determines exactly the complete set of all frequent itemsets and is usually much smaller than the latter. Yet mining frequent closed itemsets remains to be a time consuming task. Proposed in this paper is an improved algorithm DCI-closed-index for mining frequent closed itemset. Firstly, the “index array” is proposed. Using the subsume index, those itemsets that always appear together can be discovered. Then, by using bitmap, an algorithm for computing index array is presented. Thirdly, the items are sorted in frequency descending order according to their frequencies in subsume index. Fourthly, frequent items, which appear together and share the same supports, are merged to initial generators according to heuristic information provided by index array. Thus, the search space is reduced greatly. Finally, based on index array, reduced pre-set and reduced post-set are proposed. It is proved that the reduced pre-set and post-set are equivalent to original pre-set and post-set. Thus, the redundant set-inclusion operations are avoided greatly. The experimental results show that the proposed algorithm outperforms other state-of-the-art algorithms.

HTML全文

参考文献(0)

施引文献

资源附件(0)