Advanced Search
    Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567
    Citation: Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567

    Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

    • Similarity join is an important task in data cleaning, data integration and other fields, which has attracted extensive attention in recent years. With the increasing amount of data, the improvement of real-time processing requirement and the bottleneck of CPU performance improvement, the traditional serial algorithms of similarity join have been unable to meet the requirement of current big data processing. As a co-processor, GPU has achieved good acceleration results in machine learning and other fields in recent years. It is of great practical significance to study the parallel similarity join algorithms based on GPU. This paper proposes a parallel similarity join algorithm based on CPU-GPU heterogeneous architecture. Firstly, GPU is used to construct inverted index based on SoA (struct of arrays), which solves the problem of low efficiency of traditional index structure in parallel reading and writing. Then, to address the performance problem of serial algorithms, a parallel dual-length filtering algorithm based on filter-verification framework is proposed. Inverted index and prefix filtering algorithm are used to further improve the filtering performance. And in our approach, the calculation for exact similarity verification is performed by CPU to make full use of heterogeneous computing resources of CPU-GPU. Finally, experiments are carried out on several datasets. Compared with the serial similarity join algorithms, the results show that our proposed algorithms have better filtering performance and lower index generation time than existing algorithms, and also have better processing performance and higher speedup ratio on the similarity join.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return