ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (3): 738-748.doi: 10.7544/issn1000-1239.2015.20130683

• 软件技术 • 上一篇    下一篇

外存中高效的字符串相似性查询处理

王金宝1,高宏2,李建中2,杨东华1   

  1. 1(哈尔滨工业大学基础与交叉科学研究院高性能计算中心 哈尔滨 150001); 2(哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001) (wangjinbaosky@gmail.com)
  • 出版日期: 2015-03-01
  • 基金资助: 
    基金项目:国家“九七三”重点基础研究发展计划基金项目(2006CB303005);国家自然科学基金项目(60903016,60533110,60773063,61272046);教育部新世纪优秀人才支持计划基金项目(NCET-05-0333);黑龙江省教育厅科学技术研究项目(11531276);中国博士后科学基金第六批特别资助项目(2013T60372);黑龙江省自然科学基金项目(F201317);中央高校基本科研业务费专项资金项目(HIT.NSRIF.2015065)

Processing String Similarity Search in External Memory Efficiently

Wang Jinbao1, Gao Hong2, Li Jianzhong2, Yang Donghua1   

  1. 1(Center for High Performance Computing, Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin 150001); 2(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
  • Online: 2015-03-01

摘要: 字符串相似性查询是众多应用的基础操作,如数据清洁、拼写校验、生物信息学和信息集成等.随着数据的爆炸性增长,大规模字符串数据日益普遍,现代的信息系统中也广泛使用字符串作为数据的表达形式.现有支持字符串相似性查询的方法大多是基于q-gram的内存倒排索引,在处理大规模字符串集合会消耗无法忍受的内存容量,甚至在数据量过大时造成内存容量不足而无法支持查询处理.现有的外存倒排索引Behm-Index在查询的过滤阶段只支持少数过滤器,不能有效地减少查询I/O代价.提出了LPA-Index:一种支持长度过滤器和位置过滤器的外存倒排索引,并通过选择查询时使用的倒排表来有效地降低查询I/O代价.实验结果表明,与现有性能最好的外存索引Behm-Index相比,LPA-Index能够大幅降低查询的I/O代价,获得了更短的查询响应时间.

关键词: 字符串, 相似性查询, 外存, 查询处理, 编辑距离

Abstract: String similarity search is fundamental to various applications, such as data cleaning, spell checking, bioinformatics and information integration, since users tend to provide fuzzy queries in these applications due to input errors of both queried strings and data strings. With the rapid growth of data size, big string datasets become ubiquitous, and almost every modern information system stores data presented in string format. String similarity search should be well supported in modern information systems. Memory based q-gram inverted indexes fail to support string similarity search over big string datasets due to the memory constraint, and it can no longer work if the data size grows larger than memory size. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time. It is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports both length-filter and positional filter which reduce query candidates efficiently, and it selectively reads inverted lists during query processing for better I/O performance. Experiment results on both real datasets and a synthetic dataset demonstrate the efficiency of LPA-Index and its advantages over existing state of art disk index Behm-Index with regard to I/O cost and query response time.

Key words: string, similarity search, external memory, query processing, edit distance

中图分类号: