ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (3): 569-578.doi: 10.7544/issn1000-1239.2015.20131436

• 人工智能 • 上一篇    下一篇

基于磁盘存储1项集计数的增量FP_GROWTH算法

申彦1,2,朱玉全2,刘春华2   

  1. 1(江苏大学信息管理与信息系统系 江苏镇江 212013); 2(江苏大学计算机科学与通信工程学院 江苏镇江 212013) (104186179@qq.com)
  • 出版日期: 2015-03-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(70971067);国家科技支撑计划基金项目(2010BAI88B00);全国统计科学研究计划项目(2012LY160);江苏省自然科学基础研究计划基金项目(BK2010331);江苏省博士研究生创新计划基金项目(CX10B_016X);江苏大学高级人才基金项目(13JDG127);江苏省博士后科研资助计划项目(1401056C)

Incremental FP_GROWTH Algorithm Based on Disk-resident 1-itemsets Counting

Shen Yan1,2, Zhu Yuquan2, Liu Chunhua2   

  1. 1(Department of Information Management and Information System, Jiangsu University, Zhenjiang, Jiangsu 212013); 2(School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, Jiangsu 212013)
  • Online: 2015-03-01

摘要: 随着数据集规模的不断增大,提高频繁项集的挖掘效率成为数据挖掘领域的研究重点.频繁项集的增量更新挖掘算法因其可以利用已挖掘发现的信息提高对新数据集的挖掘效率,成为重要的研究方向.但现有频繁项集增量更新算法大多基于APRIORI算法框架,性能提高有限.最近出现的建立在FP-TREE等树形结构上的增量更新算法又往往存在树形结构调整困难、已发现频繁项集及树形结构保存效率较低等问题,算法性能有待进一步地提高.对此,通过分析增量挖掘过程中的关键信息,提出了一种基于磁盘存储1项集计数的增量FP_GROWTH算法(IU_FPGROWTH_1COUNTING).该算法无需保存临时树形结构及临时挖掘结果,可以在原数据集及支持度均发生变化时,减少FP_GROWTH算法对数据集的扫描,提高频繁项集的挖掘效率.在生成以及真实数据集上进行了验证实验以及性能分析,结果表明IU_FPGROWTH_1COUNTING是一种有效的频繁项集增量更新挖掘算法.

关键词: 关联规则, 频繁项集, 大规模数据, FP_GROWTH算法, 增量更新

Abstract: As the size of data set increases constantly, improving the mining efficiency of frequent itemsets has become a key research in data mining fields. The incremental update algorithm of frequent itemsets is a main research target, because these algorithms can make use of the discovered information to improve mining efficiency for new data set. However, current incremental update algorithms are almost based on the framework of APRIORI. These algorithms only have limited improvements of efficiency. There are some new incremental update algorithms recently based on FP-TREE and the like data structure. But these algorithms have some other problems such as low efficiency in tree structure adjustment, low efficiency in storage of the tree, discovered frequent itemsets and so on. The performance of algorithms should be improved further. Therefore, a novel incremental FP_GROWTH algorithm based on disk-resident 1-itemsets counting (IU_FPGROWTH_1COUNTING) is presented in this paper after analyzing the key information of incremental update mining process. This algorithm doesnt need to store the temporal tree structure and temporal mining results. When the original data set and the support change at the same time, IU_FPGROWTH_1COUNTING can improve mining efficiency of frequent itemsets by decreasing the data set scan of FP_GROWTH, which is very time-consuming. The verification experiments and performance analysis on synthetic data set and real data set are presented. The experimental results show that IU_FPGROWTH_1COUNTING is an effective incremental update algorithm of frequent itemsets.

Key words: association rules, frequent itemsets, massive data, FP_GROWTH algorithm, incremental update

中图分类号: