高级检索
    王乐, 冯林, 王水. 不产生候选项集的TOP-K高效用模式挖掘算法[J]. 计算机研究与发展, 2015, 52(2): 445-455. DOI: 10.7544/issn1000-1239.2015.20131184
    引用本文: 王乐, 冯林, 王水. 不产生候选项集的TOP-K高效用模式挖掘算法[J]. 计算机研究与发展, 2015, 52(2): 445-455. DOI: 10.7544/issn1000-1239.2015.20131184
    Wang Le, Feng Lin, Wang Shui. An Algorithm of Mining TOP-K High Utility Patterns Without Generating Candidates[J]. Journal of Computer Research and Development, 2015, 52(2): 445-455. DOI: 10.7544/issn1000-1239.2015.20131184
    Citation: Wang Le, Feng Lin, Wang Shui. An Algorithm of Mining TOP-K High Utility Patterns Without Generating Candidates[J]. Journal of Computer Research and Development, 2015, 52(2): 445-455. DOI: 10.7544/issn1000-1239.2015.20131184

    不产生候选项集的TOP-K高效用模式挖掘算法

    An Algorithm of Mining TOP-K High Utility Patterns Without Generating Candidates

    • 摘要: 目前TOP-K高效用模式挖掘算法需要产生候选项集,特别是当数据集比较大或者数据集中包含较多长事务项集时,算法的时间和空间效率会受到更大的影响.针对此问题,通过将事务项集和项集效用信息有效地保存到树结构HUP-Tree,给出一个不需要候选项集的挖掘算法TOPKHUP;HUP-Tree树能保证从中计算到每个模式的效用值,不需要再扫描数据集来计算模式的效用值,从而使挖掘算法的时空效率得到较大的提高.采用7个典型数据集对算法的性能进行测试,实验结果证明TOPKHUP的时间和空间效率都优于已有算法,并对K值的变化保持平稳.

       

      Abstract: Mining TOP-K high utility pattern from a dataset is an extension of frequent pattern mining, and it aims to mine the patterns whose utilities are higher than a user-specified minimum utility threshold. At present, it has been a topic in data mining. Existing algorithms of mining TOP-K high utility pattern generate candidate itemsets in the mining process and they need multiple scans of a dataset; this hinders their performance of runtime and memory usage, especially when a dataset is large or there are many long transaction itemsets in a dataset. To address this issue, we propose a tree structure called HUP-Tree (high utility pattern tree) to maintain transaction itemsets and their utility values, and we also give an algorithm named TOPKHUP (TOP-K high utility pattern) that mines TOP-K high utility patterns without generating candidates. HUP-Tree ensures efficient retrieval of utility value of each pattern without additional scan of the dataset, so the performance of the algorithm is effectively improved. Seven classical real and synthetic datasets are used in the testing experiments and the results show that the proposed algorithm outperforms state-of-the-art algorithms significantly for both runtime performance and memory usage, and it is more stable along the change of the value K.

       

    /

    返回文章
    返回