小规模非规则TRSM实现与优化

郭容园; 贾海鹏; 张云泉; 韦存阳; 邓明森; 陈婧蕊; 周振亚

doi:10.7544/issn1000-1239.202330864

摘要: TRSM（triangular matrix equation solver）是线性方程组求解的常用算法，是各种科学计算库和数学软件的核心算法，广泛应用于科学计算、工程计算、机器学习等领域. 小规模非规则TRSM算法限定解决问题范围，是高效处理较小规模、非规则数据输入的算法. 随着高性能计算领域个性化、精细化发展，科学界、工业界对小规模非规则TRSM计算的需求愈加明显. 传统算法更偏重于大规模、规则TRSM的计算，在小规模非规则TRSM计算上效率不佳. 结合硬件体系结构、应用场景特征提出小规模非规则TRSM优化方案，从寄存器分块、边界处理、向量化计算角度设计高性能内核，在此基础上构建覆盖双精度实数、双精度复数的小规模非规则算法库SI_TRSM （small-scale irregular TRSM），大幅度提升该算法性能. 实验结果表明，构建的双精度小规模非规则TRSM算法库，较MKL（Intel math kernel library）同类算法，在双精度小规模非规则实数上平均性能提高29.4倍，在双精度小规模非规则复数上平均性能提高24.6倍.

Abstract: TRSM (triangular matrix equation solver) is a commonly used algorithm for solving systems of linear equations, and is the core algorithm of various scientific computing libraries and mathematical software, which is widely used in the fields of scientific computing, engineering computing and machine learning. The small-scale irregular TRSM algorithm limits the scope of problem-solving and is an algorithm for efficiently handling smaller-scale, irregular data inputs. With the development of personalization and refinement in the field of high-performance computing, the demand for small-scale irregular TRSM computation in the scientific and industrial communities is becoming more and more obvious. While traditional algorithms are better suited for large-scale and regular TRSM computation, there is still room for improvement in the computational efficiency of small-scale and irregular TRSM. In this paper, we propose a small-scale irregular TRSM optimization scheme by combining hardware architecture and application scenario characteristics, designing a high-performance kernel from the perspectives of register chunking, boundary processing, and vectorization computation, and constructing an algorithmic library of small-scale irregular SI_TRSM (small-scale irregular TRSM) covering double-precision real numbers and double-precision complex numbers based on which the performance of this algorithm is greatly improved. Based on experimental results, the double-precision small-scale irregular TRSM algorithm library developed in this paper has shown to enhance the average performance of double-precision small-scale irregular real numbers by 29.4 times, and double-precision small-scale irregular complex numbers by 24.6 times in comparison with similar algorithms available in the MKL (Intel math kernel library).

小规模非规则TRSM实现与优化

Small-Scale Irregular TRSM Implementation and Optimization