一种挖掘压缩序列模式的有效算法

童咏昕; 张媛媛; 袁  玫; 马世龙; 余  丹; 赵  莉

一种挖掘压缩序列模式的有效算法

1(北京航空航天大学软件开发环境国家重点实验室北京 100191) 2(电信科学技术研究院北京 100191) 3(北京联合大学信息学院北京 100084) (yxtong@nlsde.buaa.edu.cn)

计量
- 文章访问数: 957
- HTML全文浏览量: 6
- PDF下载量: 658
出版历程
- 发布日期: 2010-01-14

An Efficient Algorithm for Mining Compressed Sequential Patterns

1(State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191) 2(China Academy of Telecommunication Technology, Beijing 100191) 3(College of Information, Beijing Union University, Beijing 100084)

摘要

摘要: 从序列数据库中挖掘频繁序列模式是数据挖掘领域的一个中心研究主题，而且该领域已经提出和研究了各种有效的序列模式挖掘算法.由于在挖掘过程中会产生大量的频繁序列模式，最近许多研究者已经不再聚焦于序列模式挖掘算法的效率，而更关注于如何让用户更容易地理解序列模式的结果集.受压缩频繁项集思想的启发，提出了一种CFSP(compressing frequent sequential patterns)算法，其可挖掘出少量有代表性的序列模式来表达全部频繁序列模式的信息，并且清除了大量的冗余序列模式.CFSP是一种two-steps的算法：在第1步，其获得了全部闭序列模式作为有代表性序列模式的候选集，与此同时还得到大多数的有代表性模式；在第2步，该算法只花费了少量的时间去发现剩余的有代表性序列模式.一个采用真实数据集与模拟数据集的实验研究也证明了CFSP算法具有高效性.
- 挖掘序列模式 /
- 压缩 /
- 频繁模式挖掘 /
- 关联规则 /
- 数据挖掘
Abstract: Mining frequent sequential patterns from sequence databases has been a central research topic in data mining and various efficient algorithms for mining sequential patterns have been proposed and studied. Recently, many researchers have not focused on the efficiency of sequential patterns mining algorithms, but have paid attention to how to make users understand the result set of sequential patterns easily, due to the huge number of frequent sequential patterns generated by the mining process. In this paper, the problem of compressing frequent sequential patterns is studied. Inspired by the ideas of compressing frequent itemsets, an algorithm, CFSP (compressing frequent sequential patterns), is developed to mine a few representative sequential patterns to express all the information of all frequent sequential patterns and eliminate a large number of redundant sequential patterns. The CFSP adopts a two-steps approach: in the first step, all closed sequential patterns as the candidate set of representative sequential patterns are obtained, and at the same time most of the representative sequential patterns are obtained; in the second step, finding the remaining representative sequential patterns takes only a little time. An empirical study with both real and synthetic data sets proves that the CFSP has good performance.
- mining sequential pattern /
- compression /
- frequent pattern mining /
- association rule /
- data mining