并行对称矩阵三对角化算法在GPU集群上的有效实现

刘世芳; 赵永华; 于天禹; 黄荣锋

doi:10.7544/issn1000-1239.2020.20190731

并行对称矩阵三对角化算法在GPU集群上的有效实现

Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster

摘要

摘要: 对称矩阵三对角化是求解稠密特征问题的关键计算过程.针对GPU集群采用了MPI(message passing interface)和GPU级2级并行方法设计实现了基于MPI和CUDA(compute unified device architecture )的稠密对称矩阵三对角化算法.在MPI集群级并行中，通过将2维通信域中行-列通信域间的全局数据通信设计为完全并行的点-点数据通信方式，改善了三对角化MPI并行算法的通信性能.通过改进原矩阵三对角化的MPI并行算法，避免了在GPU级并行中使用的不规则的矩阵-向量运算，这部分的并行性能提升了1倍左右.并且，将在GPU并行中存在的小粒度计算合并为较大粒度计算，该策略可通过加大计算密集度来充分地发挥GPU的计算能力，增加GPU的利用率，从而提升了算法的性能.此外，利用多个CUDA流使算法中独立的CUDA操作可以在不同的流中并发执行.并且，在并行算法中，利用CPU与GPU之间的异步数据传输，使得在不同流中的数据传输和核函数同时执行，隐藏了数据传输的时间，进一步提升了算法的性能.在中国科学院超级计算机系统“元”上，使用Nvidia Tesla K20 GPGPU测试了不同规模矩阵的基于MPI+CUDA的三对角化并行块算法的性能，取得了较好的加速效果与性能，并且具有良好的可扩展性.

Abstract: The symmetric matrix tridiagonalization is the key computational process for solving dense eigenproblems. This paper presents the implementation of the dense symmetric matrix tridiagonal hybrid parallel blocked algorithm based on MPI(message passing interface)+CUDA(compute unified device architecture) for GPU cluster. The parallel algorithm design uses a two-level parallel method of MPI cluster level and GPU level. In the MPI-level parallelism, the communication performance of the tridiagonal MPI parallel algorithm is improved by designing the global data communication between the row-column communication domains in the two-dimensional communication domain as a completely parallel point-to-point data communication method. Moreover, by improving the original matrix tridiagonalized MPI parallel algorithm, the use of irregular matrix-vector operation in GPU-level parallelism is avoided, and the parallel performance of this part is improved by about 1 time. What’s more, the small-grained computing existing in the GPU parallel is merged into a larger granularity calculation, which fully utilize the computing power of the GPU by increasing the computational intensity, thereby increasing the utilization of the GPU and improving the performance of the algorithm. In addition, multiple CUDA streams can be used to enable independent CUDA operations in the algorithm to be concurrently executed in different streams. Furthermore, in the parallel algorithm, the asynchronous data transmission between the CPU and the GPU is utilized, so that the data transmission and the kernel function in different streams are simultaneously executed, which hides the time of data transmission and improves the performance of the algorithm. On the supercomputer system Era of the Computer Network Center of the Chinese Academy of Sciences, each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors, we tested the performance of the implementation of the tridiagonalization blocked algorithm with GPGPU cards. It has achieved better acceleration, performance and scalability.

HTML全文

参考文献(0)

施引文献

资源附件(0)