无重复投影数据库扫描的序列模式挖掘算法

张  坤      朱扬勇

无重复投影数据库扫描的序列模式挖掘算法

张坤朱扬勇

Sequence Pattern Mining Without Duplicate Project Database Scan

Zhang Kun and Zhu Yangyong

摘要

摘要: 序列模式挖掘在Web点击流分析、自然灾害预测、DNA和蛋白质序列模式发现等领域有着广泛应用.基于频繁模式增长的PrefixSpan是目前性能最好的序列模式挖掘算法之一.然而在密数据集和长序列模式挖掘过程中会出现大量的重复投影数据库，使得这类算法性能下降.算法SPMDS通过对投影数据库的伪投影做单项杂凑函数，如MD5等，检查是否存在重复的投影数据库，避免大量重复数据库的扫描，并采用一些必要条件简化投影数据库的搜索，进而提高算法的性能.实验和分析都表明SPMDS性能优于PrefixSpan.

Abstract: Sequence pattern mining has broad applications in the analysis of Web click streams, the prediction of disasters and the pattern discovery of DNA and protein sequences. PrefixSpan, which is based on frequent pattern growth approach, is currently one of the fastest algorithms towards this target. However, PrefixSpan will produce huge amount of duplicated project databases in mining dense data sets and long sequence patterns. In order to overcome this drawback, a random algorithm named SPMDS is proposed. The algorithm avoids scanning duplicated project databases by checking evidences computed by exercising one way hash function such as MD5 to pseudo projections of project databases, and also improves its performance by simplifying the search in the project tree using some necessary conditions. Both experiments and analyses show that SPMDS is better than PrefixSpan.

HTML全文

参考文献(0)

施引文献

资源附件(0)