ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (2): 445-455.doi: 10.7544/issn1000-1239.2015.20131184

• 人工智能 • 上一篇    下一篇

不产生候选项集的TOP-K高效用模式挖掘算法

王乐1,2,冯林2,王水1   

  1. 1(宁波大红鹰学院信息工程学院 浙江宁波 315175); 2(大连理工大学电子信息与电气工程学部计算机科学与技术学院 辽宁大连 116024) (wangleboro@163.com)
  • 出版日期: 2015-02-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61370200);宁波市自然科学基金项目(2013A610115,2014A610073);浙江省教育厅一般科研项目(Y201432717)

An Algorithm of Mining TOP-K High Utility Patterns Without Generating Candidates

Wang Le1,2, Feng Lin2, Wang Shui1   

  1. 1(School of Information Engineering, Ningbo Dahongying University, Ningbo, Zhejiang 315175); 2(School of Computer Science and Technology, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning 116024)
  • Online: 2015-02-01

摘要: 目前TOP-K高效用模式挖掘算法需要产生候选项集,特别是当数据集比较大或者数据集中包含较多长事务项集时,算法的时间和空间效率会受到更大的影响.针对此问题,通过将事务项集和项集效用信息有效地保存到树结构HUP-Tree,给出一个不需要候选项集的挖掘算法TOPKHUP;HUP-Tree树能保证从中计算到每个模式的效用值,不需要再扫描数据集来计算模式的效用值,从而使挖掘算法的时空效率得到较大的提高.采用7个典型数据集对算法的性能进行测试,实验结果证明TOPKHUP的时间和空间效率都优于已有算法,并对K值的变化保持平稳.

关键词: 高效用模式, 频繁模式, 频繁项集, 数据挖掘, TOP-K

Abstract: Mining TOP-K high utility pattern from a dataset is an extension of frequent pattern mining, and it aims to mine the patterns whose utilities are higher than a user-specified minimum utility threshold. At present, it has been a topic in data mining. Existing algorithms of mining TOP-K high utility pattern generate candidate itemsets in the mining process and they need multiple scans of a dataset; this hinders their performance of runtime and memory usage, especially when a dataset is large or there are many long transaction itemsets in a dataset. To address this issue, we propose a tree structure called HUP-Tree (high utility pattern tree) to maintain transaction itemsets and their utility values, and we also give an algorithm named TOPKHUP (TOP-K high utility pattern) that mines TOP-K high utility patterns without generating candidates. HUP-Tree ensures efficient retrieval of utility value of each pattern without additional scan of the dataset, so the performance of the algorithm is effectively improved. Seven classical real and synthetic datasets are used in the testing experiments and the results show that the proposed algorithm outperforms state-of-the-art algorithms significantly for both runtime performance and memory usage, and it is more stable along the change of the value K.

Key words: high utility pattern, frequent pattern, frequent itemsets, data mining, TOP-K

中图分类号: