%A Liu Shifang, Zhao Yonghua, Yu Tianyu, Huang Rongfeng
%T Efficient Implementation of Parallel Symmetric Matrix Tridiagonalization Algorithm on GPU Cluster
%0 Journal Article
%D 2020
%J Journal of Computer Research and Development
%R 10.7544/issn1000-1239.2020.20190731
%P 2635-2647
%V 57
%N 12
%U {https://crad.ict.ac.cn/CN/abstract/article_4316.shtml}
%8 2020-12-01
%X The symmetric matrix tridiagonalization is the key computational process for solving dense eigenproblems. This paper presents the implementation of the dense symmetric matrix tridiagonal hybrid parallel blocked algorithm based on MPI(message passing interface)+CUDA(compute unified device architecture) for GPU cluster. The parallel algorithm design uses a two-level parallel method of MPI cluster level and GPU level. In the MPI-level parallelism, the communication performance of the tridiagonal MPI parallel algorithm is improved by designing the global data communication between the row-column communication domains in the two-dimensional communication domain as a completely parallel point-to-point data communication method. Moreover, by improving the original matrix tridiagonalized MPI parallel algorithm, the use of irregular matrix-vector operation in GPU-level parallelism is avoided, and the parallel performance of this part is improved by about 1 time. Whatâ€™s more, the small-grained computing existing in the GPU parallel is merged into a larger granularity calculation, which fully utilize the computing power of the GPU by increasing the computational intensity, thereby increasing the utilization of the GPU and improving the performance of the algorithm. In addition, multiple CUDA streams can be used to enable independent CUDA operations in the algorithm to be concurrently executed in different streams. Furthermore, in the parallel algorithm, the asynchronous data transmission between the CPU and the GPU is utilized, so that the data transmission and the kernel function in different streams are simultaneously executed, which hides the time of data transmission and improves the performance of the algorithm. On the supercomputer system Era of the Computer Network Center of the Chinese Academy of Sciences, each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors, we tested the performance of the implementation of the tridiagonalization blocked algorithm with GPGPU cards. It has achieved better acceleration, performance and scalability.