ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (12): 2635-2647.doi: 10.7544/issn1000-1239.2020.20190731

• 软件技术 • 上一篇    下一篇

并行对称矩阵三对角化算法在GPU集群上的有效实现

刘世芳1,2,赵永华1,于天禹1,2,黄荣锋1,2   

  1. 1(中国科学院计算机网络信息中心 北京 100190);2(中国科学院大学 北京 100190) (liushifang@cnic.cn)
  • 出版日期: 2020-12-01
  • 基金资助: 
    国家重点研发计划项目(2017YFB0202202);中国科学院战略性先导科技专项(C类)(XDC01040000)

Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster

Liu Shifang1,2, Zhao Yonghua1, Yu Tianyu1,2, Huang Rongfeng1,2   

  1. 1(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190);2(University of Chinese Academy of Sciences, Beijing 100190)
  • Online: 2020-12-01
  • Supported by: 
    This work was supported by National Key Research and Development Program of China (2017YFB0202202) and the Strategic Priority Research Program of Chinese Academy of Sciences (C) (XDC01040000).

摘要: 对称矩阵三对角化是求解稠密特征问题的关键计算过程.针对GPU集群采用了MPI(message passing interface)和GPU级2级并行方法设计实现了基于MPI和CUDA(compute unified device architecture )的稠密对称矩阵三对角化算法.在MPI集群级并行中,通过将2维通信域中行-列通信域间的全局数据通信设计为完全并行的点-点数据通信方式,改善了三对角化MPI并行算法的通信性能.通过改进原矩阵三对角化的MPI并行算法,避免了在GPU级并行中使用的不规则的矩阵-向量运算,这部分的并行性能提升了1倍左右.并且,将在GPU并行中存在的小粒度计算合并为较大粒度计算,该策略可通过加大计算密集度来充分地发挥GPU的计算能力,增加GPU的利用率,从而提升了算法的性能.此外,利用多个CUDA流使算法中独立的CUDA操作可以在不同的流中并发执行.并且,在并行算法中,利用CPU与GPU之间的异步数据传输,使得在不同流中的数据传输和核函数同时执行,隐藏了数据传输的时间,进一步提升了算法的性能.在中国科学院超级计算机系统“元”上,使用Nvidia Tesla K20 GPGPU测试了不同规模矩阵的基于MPI+CUDA的三对角化并行块算法的性能,取得了较好的加速效果与性能,并且具有良好的可扩展性.

关键词: 对称矩阵三对角化, MPI+CUDA, 点-点数据通信, 计算密集度, CUDA流, 可扩展性

Abstract: The symmetric matrix tridiagonalization is the key computational process for solving dense eigenproblems. This paper presents the implementation of the dense symmetric matrix tridiagonal hybrid parallel blocked algorithm based on MPI(message passing interface)+CUDA(compute unified device architecture) for GPU cluster. The parallel algorithm design uses a two-level parallel method of MPI cluster level and GPU level. In the MPI-level parallelism, the communication performance of the tridiagonal MPI parallel algorithm is improved by designing the global data communication between the row-column communication domains in the two-dimensional communication domain as a completely parallel point-to-point data communication method. Moreover, by improving the original matrix tridiagonalized MPI parallel algorithm, the use of irregular matrix-vector operation in GPU-level parallelism is avoided, and the parallel performance of this part is improved by about 1 time. What’s more, the small-grained computing existing in the GPU parallel is merged into a larger granularity calculation, which fully utilize the computing power of the GPU by increasing the computational intensity, thereby increasing the utilization of the GPU and improving the performance of the algorithm. In addition, multiple CUDA streams can be used to enable independent CUDA operations in the algorithm to be concurrently executed in different streams. Furthermore, in the parallel algorithm, the asynchronous data transmission between the CPU and the GPU is utilized, so that the data transmission and the kernel function in different streams are simultaneously executed, which hides the time of data transmission and improves the performance of the algorithm. On the supercomputer system Era of the Computer Network Center of the Chinese Academy of Sciences, each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors, we tested the performance of the implementation of the tridiagonalization blocked algorithm with GPGPU cards. It has achieved better acceleration, performance and scalability.

Key words: symmetric matrix tridiagonalization, MPI+CUDA, point-to-point data communication, computational intensity, CUDA streams, scalability

中图分类号: