一种优化多重过滤的序列查询算法

戴东波  汤春蕾  邱伯仁  熊赟  朱扬勇

一种优化多重过滤的序列查询算法

戴东波汤春蕾邱伯仁熊赟朱扬勇

An Algorithm for Sequence Similarity Query with Optimized Multiple Filtering

Dai Dongbo, Tang Chunlei, Qiu Boren, Xiong Yun, and Zhu Yangyong

摘要

摘要: 序列数据一类重要的数据类型，在文本、Web访问日志文件、生物数据库等应用中普遍存在，对其进行相似性查询是一种获取有用信息的重要手段.在大型序列数据库中进行高效相似性查询的关键因素之一就是查询算法的过滤能力，即设计能快速过滤与查询序列不相关序列集的过滤器十分重要.提出了结合序列距离的度量性质和序列自身特征的多重过滤算法SSQ_MF，SSQ_MF使用了长度过滤器、前缀过滤器和基于参考集的过滤器，使得算法过滤能力较基于单一过滤器算法进一步增强.此外，设计了有关数据结构对查询数据库的一些统计信息进行了预计算和保存，有效估计了各过滤器的过滤集大小，并构建了一个由过滤集大小确定的最优过滤顺序模型，使得算法的过滤代价最低.实验结果表明，算法SSQ_MF的查询性能优于单一过滤器算法和随机过滤顺序的多过滤器算法.

Abstract: Sequence data is an important data type, ubiquitous in many domains such as text, Web access log and biological database. Similarity query in this kind of data is a very important means for extracting useful information. One key factor for high performance of similarity query in huge sequence database is the filtering level of query algorithm，namely, designing those filters that can quickly filter out the unpromising strings for query string is very important. Combining metric property of sequences distance with sequences characteristics, an algorithm called SSQ_MF based on multiple filters is proposed, whose filtering level is further improved compared with those query algorithms with a single filter. The filters used in SSQ_MF are length filter,prefix filter, and reference-based filter. Then, the statistical information about the string database is pre-computed and some data structures are used to store those information. Furthermore, every filters filtering size is effectively estimated and a model in which optimal filtering order is determined by every filters filtering size is built. Comprehensive experimental results demonstrate that in terms of query performance, SSQ_MF is better than algorithms with a single filter and algorithms with multiple filters but executing in a random order.

HTML全文

参考文献(0)

施引文献

资源附件(0)