ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (3): 598-608.doi: 10.7544/issn1000-1239.2021.20190567

Previous Articles     Next Articles

Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge   

  1. (School of Computer Science and Engineering, Northeastern University, Shenyang 110169)
  • Online:2021-03-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB1003404) and the National Natural Science Foundation of China (U1811261, 61672142).

Abstract: Similarity join is an important task in data cleaning, data integration and other fields, which has attracted extensive attention in recent years. With the increasing amount of data, the improvement of real-time processing requirement and the bottleneck of CPU performance improvement, the traditional serial algorithms of similarity join have been unable to meet the requirement of current big data processing. As a co-processor, GPU has achieved good acceleration results in machine learning and other fields in recent years. It is of great practical significance to study the parallel similarity join algorithms based on GPU. This paper proposes a parallel similarity join algorithm based on CPU-GPU heterogeneous architecture. Firstly, GPU is used to construct inverted index based on SoA (struct of arrays), which solves the problem of low efficiency of traditional index structure in parallel reading and writing. Then, to address the performance problem of serial algorithms, a parallel dual-length filtering algorithm based on filter-verification framework is proposed. Inverted index and prefix filtering algorithm are used to further improve the filtering performance. And in our approach, the calculation for exact similarity verification is performed by CPU to make full use of heterogeneous computing resources of CPU-GPU. Finally, experiments are carried out on several datasets. Compared with the serial similarity join algorithms, the results show that our proposed algorithms have better filtering performance and lower index generation time than existing algorithms, and also have better processing performance and higher speedup ratio on the similarity join.

Key words: similarity join, filter-verification framework, inverted index, GPU parallel processing, heterogeneous architecture

CLC Number: