Abstract:
Sequence data is an important data type, ubiquitous in many domains such as text, Web access log and biological database. Similarity query in this kind of data is a very important means for extracting useful information. One key factor for high performance of similarity query in huge sequence database is the filtering level of query algorithm,namely, designing those filters that can quickly filter out the unpromising strings for query string is very important. Combining metric property of sequences distance with sequences characteristics, an algorithm called SSQ_MF based on multiple filters is proposed, whose filtering level is further improved compared with those query algorithms with a single filter. The filters used in SSQ_MF are length filter,prefix filter, and reference-based filter. Then, the statistical information about the string database is pre-computed and some data structures are used to store those information. Furthermore, every filters filtering size is effectively estimated and a model in which optimal filtering order is determined by every filters filtering size is built. Comprehensive experimental results demonstrate that in terms of query performance, SSQ_MF is better than algorithms with a single filter and algorithms with multiple filters but executing in a random order.