• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Wang Jinbao, Gao Hong, Li Jianzhong, Yang Donghua. Processing String Similarity Search in External Memory Efficiently[J]. Journal of Computer Research and Development, 2015, 52(3): 738-748. DOI: 10.7544/issn1000-1239.2015.20130683
Citation: Wang Jinbao, Gao Hong, Li Jianzhong, Yang Donghua. Processing String Similarity Search in External Memory Efficiently[J]. Journal of Computer Research and Development, 2015, 52(3): 738-748. DOI: 10.7544/issn1000-1239.2015.20130683

Processing String Similarity Search in External Memory Efficiently

More Information
  • Published Date: February 28, 2015
  • String similarity search is fundamental to various applications, such as data cleaning, spell checking, bioinformatics and information integration, since users tend to provide fuzzy queries in these applications due to input errors of both queried strings and data strings. With the rapid growth of data size, big string datasets become ubiquitous, and almost every modern information system stores data presented in string format. String similarity search should be well supported in modern information systems. Memory based q-gram inverted indexes fail to support string similarity search over big string datasets due to the memory constraint, and it can no longer work if the data size grows larger than memory size. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time. It is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports both length-filter and positional filter which reduce query candidates efficiently, and it selectively reads inverted lists during query processing for better I/O performance. Experiment results on both real datasets and a synthetic dataset demonstrate the efficiency of LPA-Index and its advantages over existing state of art disk index Behm-Index with regard to I/O cost and query response time.
  • Related Articles

    [1]Wang Mengru, Yao Yunzhi, Xi Zekun, Zhang Jintian, Wang Peng, Xu Ziwen, Zhang Ningyu. Safety Analysis of Large Model Content Generation Based on Knowledge Editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143-1155. DOI: 10.7544/issn1000-1239.202330965
    [2]Shi Jiaoli, Huang Chuanhe, He Kai, Shen Xieyang, Hua Chao. An Access Control Method Supporting Multi-User Collaborative Edit in Cloud Storage[J]. Journal of Computer Research and Development, 2017, 54(7): 1603-1616. DOI: 10.7544/issn1000-1239.2017.20151135
    [3]Wang Shaopeng, Wen Yingyou, Li Zhi, and Zhao Hong. A Fast Processing Algorithm on Section Disjoint Query of Data Stream[J]. Journal of Computer Research and Development, 2014, 51(5): 1136-1148.
    [4]Zhu Huaijie, Wang Jiaying, Wang Bin, and Yang Xiaochun. Location Privacy Preserving Obstructed Nearest Neighbor Queries[J]. Journal of Computer Research and Development, 2014, 51(1): 115-125.
    [5]Yang Zexue, Hao Zhongxiao. Group Obstacle Nearest Neighbor Query in Spatial Database[J]. Journal of Computer Research and Development, 2013, 50(11): 2455-2462.
    [6]Zhu Yangyong, Dai Dongbo, and Xiong Yun. A Survey of the Research on Similarity Query Technique of Sequence Data[J]. Journal of Computer Research and Development, 2010, 47(2): 264-276.
    [7]Wang Bin, Guo Qing, Li Zhongbo, Yang Xiaochun. Index Structures for Supporting Block Edit Distance[J]. Journal of Computer Research and Development, 2010, 47(1): 191-199.
    [8]Xu Shifeng, Gao Jun, Yang Dongqing, and Wang Tengjiao. Pass-Count-Based Path Query on Big Graph Datasets[J]. Journal of Computer Research and Development, 2010, 47(1): 96-103.
    [9]Yang Yuedong, Wang Lili, and Hao Aimin. Motion String: A Motion Capture Data Representation for Behavior Segmentation[J]. Journal of Computer Research and Development, 2008, 45(3): 527-534.
    [10]He Honghui, Wang Lizhen, and Zhou Lihua. pgi-distance: An Efficient Method Supporting Parallel KNN-join Process[J]. Journal of Computer Research and Development, 2007, 44(10): 1774-1781.

Catalog

    Article views (1143) PDF downloads (550) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return