• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于CPU-GPU异构体系结构的并行字符串相似性连接方法

徐坤浩, 聂铁铮, 申德荣, 寇月, 于戈

徐坤浩, 聂铁铮, 申德荣, 寇月, 于戈. 基于CPU-GPU异构体系结构的并行字符串相似性连接方法[J]. 计算机研究与发展, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567
引用本文: 徐坤浩, 聂铁铮, 申德荣, 寇月, 于戈. 基于CPU-GPU异构体系结构的并行字符串相似性连接方法[J]. 计算机研究与发展, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567
Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567
Citation: Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567
徐坤浩, 聂铁铮, 申德荣, 寇月, 于戈. 基于CPU-GPU异构体系结构的并行字符串相似性连接方法[J]. 计算机研究与发展, 2021, 58(3): 598-608. CSTR: 32373.14.issn1000-1239.2021.20190567
引用本文: 徐坤浩, 聂铁铮, 申德荣, 寇月, 于戈. 基于CPU-GPU异构体系结构的并行字符串相似性连接方法[J]. 计算机研究与发展, 2021, 58(3): 598-608. CSTR: 32373.14.issn1000-1239.2021.20190567
Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. CSTR: 32373.14.issn1000-1239.2021.20190567
Citation: Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. CSTR: 32373.14.issn1000-1239.2021.20190567

基于CPU-GPU异构体系结构的并行字符串相似性连接方法

基金项目: 国家重点研发计划项目(2018YFB1003404);国家自然科学基金项目(U1811261, 61672142)
详细信息
  • 中图分类号: TP391

Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture

Funds: This work was supported by the National Key Research and Development Program of China (2018YFB1003404) and the National Natural Science Foundation of China (U1811261, 61672142).
  • 摘要: 相似性连接技术在数据清洗、数据集成等领域中具有重要意义, 近年来引起了学术界的广泛关注.随着数据量的不断增大、数据处理实时性的要求逐渐提高以及处理器性能提升瓶颈的出现, 传统的串行相似性连接方法已经不能满足当前大数据处理的需求.近些年, GPU作为协处理器在机器学习等领域取得了良好的加速效果, 因此基于GPU的并行算法开始成为解决各类性能问题的有效解决方案.为此, 提出了基于CPU-GPU异构体系的并行相似性连接方法.首先, 方法使用GPU构建倒排索引, 索引采用SoA(struct of arrays)结构, 从而解决了传统索引结构在并行模式下读写效率低的问题.其次, 针对串行算法的性能问题, 提出基于过滤验证框架的并行双重长度过滤算法, 其中利用前缀过滤和构建好的倒排索引提升过滤效果.方法中相似度精确计算验证过程使用CPU计算执行, 从而充分利用CPU-GPU的异构计算资源.最后, 在多个数据集上进行实验验证性能.通过与串行相似性连接算法进行对比, 实验结果表明所提出方法相对于已有方法具有更好的过滤效果和更低的索引生成代价, 并在相似性连接上具有更好的性能和良好的加速比.
    Abstract: Similarity join is an important task in data cleaning, data integration and other fields, which has attracted extensive attention in recent years. With the increasing amount of data, the improvement of real-time processing requirement and the bottleneck of CPU performance improvement, the traditional serial algorithms of similarity join have been unable to meet the requirement of current big data processing. As a co-processor, GPU has achieved good acceleration results in machine learning and other fields in recent years. It is of great practical significance to study the parallel similarity join algorithms based on GPU. This paper proposes a parallel similarity join algorithm based on CPU-GPU heterogeneous architecture. Firstly, GPU is used to construct inverted index based on SoA (struct of arrays), which solves the problem of low efficiency of traditional index structure in parallel reading and writing. Then, to address the performance problem of serial algorithms, a parallel dual-length filtering algorithm based on filter-verification framework is proposed. Inverted index and prefix filtering algorithm are used to further improve the filtering performance. And in our approach, the calculation for exact similarity verification is performed by CPU to make full use of heterogeneous computing resources of CPU-GPU. Finally, experiments are carried out on several datasets. Compared with the serial similarity join algorithms, the results show that our proposed algorithms have better filtering performance and lower index generation time than existing algorithms, and also have better processing performance and higher speedup ratio on the similarity join.
  • 期刊类型引用(4)

    1. 朱晓丽,高鹏. 基于启发式算法的计算机异构大数据跨源调度方法. 新乡学院学报. 2024(06): 23-27 . 百度学术
    2. 张喜铭,零颖俏,马一宁,杨春. 基于混沌序列的无线传感网络敏感数据分层加密存储方法. 传感技术学报. 2024(11): 1964-1970 . 百度学术
    3. 吴文炤,王卫卫,邱镇,郭庆,程琳. 基于“CPU+GPU”的人工智能运行平台的实时监控方法. 计算技术与自动化. 2023(02): 86-90 . 百度学术
    4. 胡杰,刘凯. 基于GPU的卫星信号信道化并行设计与实现. 电子测量技术. 2022(13): 159-163 . 百度学术

    其他类型引用(0)

计量
  • 文章访问数:  585
  • HTML全文浏览量:  1
  • PDF下载量:  289
  • 被引次数: 4
出版历程
  • 发布日期:  2021-02-28

目录

    /

    返回文章
    返回