Abstract:
Sequence data is ubiquitous in many domains such as text, Web access log and biological database. Similarity query in sequence data is a very important means for extracting useful information. Recently, with the development of various scientific computing and the generation of large scale sequence data, similarity query on sequence data is becoming a hot research topic. Some important issues related to it are: similarity metrics used in different application fields and the mutual connections between them; statistical information of distance distribution on random sequence collections as well as its function for analyzing the performance of query algorithms; different kinds of key techniques for efficiently answering similarity queries in large scale datasets and the comparisons between their merits and demerits. In this survey, the classification and characteristics of sequence data is summarized. Some kinds of similarity metrics and statistical information about distance between random sequences are also presented and the relationships among these similarity metrics are further analyzed. Then, some types of similarity query and key issues in point are introduced. Based on these foundations, this paper focuses on the classification and evaluation of key techniques on sequence similarity search. Finally, some challenges on similarity query of sequence data are discussed and future research trends are also summarized.