Abstract:
Sequence pattern mining has broad applications in the analysis of Web click streams, the prediction of disasters and the pattern discovery of DNA and protein sequences. PrefixSpan, which is based on frequent pattern growth approach, is currently one of the fastest algorithms towards this target. However, PrefixSpan will produce huge amount of duplicated project databases in mining dense data sets and long sequence patterns. In order to overcome this drawback, a random algorithm named SPMDS is proposed. The algorithm avoids scanning duplicated project databases by checking evidences computed by exercising one way hash function such as MD5 to pseudo projections of project databases, and also improves its performance by simplifying the search in the project tree using some necessary conditions. Both experiments and analyses show that SPMDS is better than PrefixSpan.