• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

并行对称矩阵三对角化算法在GPU集群上的有效实现

刘世芳, 赵永华, 于天禹, 黄荣锋

刘世芳, 赵永华, 于天禹, 黄荣锋. 并行对称矩阵三对角化算法在GPU集群上的有效实现[J]. 计算机研究与发展, 2020, 57(12): 2635-2647. DOI: 10.7544/issn1000-1239.2020.20190731
引用本文: 刘世芳, 赵永华, 于天禹, 黄荣锋. 并行对称矩阵三对角化算法在GPU集群上的有效实现[J]. 计算机研究与发展, 2020, 57(12): 2635-2647. DOI: 10.7544/issn1000-1239.2020.20190731
Liu Shifang, Zhao Yonghua, Yu Tianyu, Huang Rongfeng. Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster[J]. Journal of Computer Research and Development, 2020, 57(12): 2635-2647. DOI: 10.7544/issn1000-1239.2020.20190731
Citation: Liu Shifang, Zhao Yonghua, Yu Tianyu, Huang Rongfeng. Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster[J]. Journal of Computer Research and Development, 2020, 57(12): 2635-2647. DOI: 10.7544/issn1000-1239.2020.20190731
刘世芳, 赵永华, 于天禹, 黄荣锋. 并行对称矩阵三对角化算法在GPU集群上的有效实现[J]. 计算机研究与发展, 2020, 57(12): 2635-2647. CSTR: 32373.14.issn1000-1239.2020.20190731
引用本文: 刘世芳, 赵永华, 于天禹, 黄荣锋. 并行对称矩阵三对角化算法在GPU集群上的有效实现[J]. 计算机研究与发展, 2020, 57(12): 2635-2647. CSTR: 32373.14.issn1000-1239.2020.20190731
Liu Shifang, Zhao Yonghua, Yu Tianyu, Huang Rongfeng. Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster[J]. Journal of Computer Research and Development, 2020, 57(12): 2635-2647. CSTR: 32373.14.issn1000-1239.2020.20190731
Citation: Liu Shifang, Zhao Yonghua, Yu Tianyu, Huang Rongfeng. Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster[J]. Journal of Computer Research and Development, 2020, 57(12): 2635-2647. CSTR: 32373.14.issn1000-1239.2020.20190731

并行对称矩阵三对角化算法在GPU集群上的有效实现

基金项目: 国家重点研发计划项目(2017YFB0202202);中国科学院战略性先导科技专项(C类)(XDC01040000)
详细信息
  • 中图分类号: TP301

Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster

Funds: This work was supported by National Key Research and Development Program of China (2017YFB0202202) and the Strategic Priority Research Program of Chinese Academy of Sciences (C) (XDC01040000).
  • 摘要: 对称矩阵三对角化是求解稠密特征问题的关键计算过程.针对GPU集群采用了MPI(message passing interface)和GPU级2级并行方法设计实现了基于MPI和CUDA(compute unified device architecture )的稠密对称矩阵三对角化算法.在MPI集群级并行中,通过将2维通信域中行-列通信域间的全局数据通信设计为完全并行的点-点数据通信方式,改善了三对角化MPI并行算法的通信性能.通过改进原矩阵三对角化的MPI并行算法,避免了在GPU级并行中使用的不规则的矩阵-向量运算,这部分的并行性能提升了1倍左右.并且,将在GPU并行中存在的小粒度计算合并为较大粒度计算,该策略可通过加大计算密集度来充分地发挥GPU的计算能力,增加GPU的利用率,从而提升了算法的性能.此外,利用多个CUDA流使算法中独立的CUDA操作可以在不同的流中并发执行.并且,在并行算法中,利用CPU与GPU之间的异步数据传输,使得在不同流中的数据传输和核函数同时执行,隐藏了数据传输的时间,进一步提升了算法的性能.在中国科学院超级计算机系统“元”上,使用Nvidia Tesla K20 GPGPU测试了不同规模矩阵的基于MPI+CUDA的三对角化并行块算法的性能,取得了较好的加速效果与性能,并且具有良好的可扩展性.
    Abstract: The symmetric matrix tridiagonalization is the key computational process for solving dense eigenproblems. This paper presents the implementation of the dense symmetric matrix tridiagonal hybrid parallel blocked algorithm based on MPI(message passing interface)+CUDA(compute unified device architecture) for GPU cluster. The parallel algorithm design uses a two-level parallel method of MPI cluster level and GPU level. In the MPI-level parallelism, the communication performance of the tridiagonal MPI parallel algorithm is improved by designing the global data communication between the row-column communication domains in the two-dimensional communication domain as a completely parallel point-to-point data communication method. Moreover, by improving the original matrix tridiagonalized MPI parallel algorithm, the use of irregular matrix-vector operation in GPU-level parallelism is avoided, and the parallel performance of this part is improved by about 1 time. What’s more, the small-grained computing existing in the GPU parallel is merged into a larger granularity calculation, which fully utilize the computing power of the GPU by increasing the computational intensity, thereby increasing the utilization of the GPU and improving the performance of the algorithm. In addition, multiple CUDA streams can be used to enable independent CUDA operations in the algorithm to be concurrently executed in different streams. Furthermore, in the parallel algorithm, the asynchronous data transmission between the CPU and the GPU is utilized, so that the data transmission and the kernel function in different streams are simultaneously executed, which hides the time of data transmission and improves the performance of the algorithm. On the supercomputer system Era of the Computer Network Center of the Chinese Academy of Sciences, each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors, we tested the performance of the implementation of the tridiagonalization blocked algorithm with GPGPU cards. It has achieved better acceleration, performance and scalability.
  • 期刊类型引用(7)

    1. 李志博,李清宝,兰明敬. 基于ART优化选择策略的遗传算法生成测试数据方法. 计算机科学. 2024(06): 95-103 . 百度学术
    2. 祁春阳,黄杰,赵翔宇,汪周红. 云边协同的轻量级网络结构人脸识别方法. 东南大学学报(自然科学版). 2023(01): 1-13 . 百度学术
    3. 许喆,王志宏,单存宇,孙亚茹,杨莹. 基于重构误差的无监督人脸伪造视频检测. 计算机应用. 2023(05): 1571-1577 . 百度学术
    4. 封筠,史屹琛,高宇豪,贺晶晶,余梓彤. 二次解耦与活体特征渐进式对齐的域自适应人脸反欺诈. 计算机研究与发展. 2023(08): 1727-1739 . 本站查看
    5. 章育涛,黎英,杨雅莉. 社交网站图像分析研究综述. 信息技术与信息化. 2023(08): 114-121 . 百度学术
    6. 史屹琛,封筠,肖立轩,贺晶晶,胡晶晶. 领域外人脸活体检测综述. 计算机科学与探索. 2022(11): 2471-2486 . 百度学术
    7. 李书领,魏君飞,庄岩,曹仰杰,李颉,任红军. 基于频域水印的人脸图像窜改检测认证方法. 计算机应用研究. 2022(12): 3776-3780 . 百度学术

    其他类型引用(6)

计量
  • 文章访问数:  827
  • HTML全文浏览量:  3
  • PDF下载量:  251
  • 被引次数: 13
出版历程
  • 发布日期:  2020-11-30

目录

    /

    返回文章
    返回