Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

Xu Kunhao; Nie Tiezheng; Shen Derong; Kou Yue; Yu Ge

doi:10.7544/issn1000-1239.2021.20190567

Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567

Citation:

Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

Graphical Abstract

Abstract

Abstract

Similarity join is an important task in data cleaning, data integration and other fields, which has attracted extensive attention in recent years. With the increasing amount of data, the improvement of real-time processing requirement and the bottleneck of CPU performance improvement, the traditional serial algorithms of similarity join have been unable to meet the requirement of current big data processing. As a co-processor, GPU has achieved good acceleration results in machine learning and other fields in recent years. It is of great practical significance to study the parallel similarity join algorithms based on GPU. This paper proposes a parallel similarity join algorithm based on CPU-GPU heterogeneous architecture. Firstly, GPU is used to construct inverted index based on SoA (struct of arrays), which solves the problem of low efficiency of traditional index structure in parallel reading and writing. Then, to address the performance problem of serial algorithms, a parallel dual-length filtering algorithm based on filter-verification framework is proposed. Inverted index and prefix filtering algorithm are used to further improve the filtering performance. And in our approach, the calculation for exact similarity verification is performed by CPU to make full use of heterogeneous computing resources of CPU-GPU. Finally, experiments are carried out on several datasets. Compared with the serial similarity join algorithms, the results show that our proposed algorithms have better filtering performance and lower index generation time than existing algorithms, and also have better processing performance and higher speedup ratio on the similarity join.

FullText(HTML)

References (0)

Supplements (0)

Cited By

Turn off MathJax

Article Contents

Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

Abstract

Catalog

Export File

Citation

Format

Content