Citation: | Guo Rongyuan, Jia Haipeng, Zhang Yunquan, Wei Cunyang, Deng Mingsen, Chen Jingrui, Zhou Zhenya. Small-Scale Irregular TRSM Implementation and Optimization[J]. Journal of Computer Research and Development, 2025, 62(2): 517-531. DOI: 10.7544/issn1000-1239.202330864 |
TRSM (triangular matrix equation solver) is a commonly used algorithm for solving systems of linear equations, and is the core algorithm of various scientific computing libraries and mathematical software, which is widely used in the fields of scientific computing, engineering computing and machine learning. The small-scale irregular TRSM algorithm limits the scope of problem-solving and is an algorithm for efficiently handling smaller-scale, irregular data inputs. With the development of personalization and refinement in the field of high-performance computing, the demand for small-scale irregular TRSM computation in the scientific and industrial communities is becoming more and more obvious. While traditional algorithms are better suited for large-scale and regular TRSM computation, there is still room for improvement in the computational efficiency of small-scale and irregular TRSM. In this paper, we propose a small-scale irregular TRSM optimization scheme by combining hardware architecture and application scenario characteristics, designing a high-performance kernel from the perspectives of register chunking, boundary processing, and vectorization computation, and constructing an algorithmic library of small-scale irregular SI_TRSM (small-scale irregular TRSM) covering double-precision real numbers and double-precision complex numbers based on which the performance of this algorithm is greatly improved. Based on experimental results, the double-precision small-scale irregular TRSM algorithm library developed in this paper has shown to enhance the average performance of double-precision small-scale irregular real numbers by 29.4 times, and double-precision small-scale irregular complex numbers by 24.6 times in comparison with similar algorithms available in the MKL (Intel math kernel library).
[1] |
BLAS Forum. BLAS: Basic linear algebra subprograms[EB/OL]. [2023-05-28]. http://netlib.org/blas/
|
[2] |
Choi J, Dongarra J J, Walker D W. PB-BLAS: A set of parallel block basic linear algebra subprograms[J]. Concurrency: Practice and Experience, 1996, 8(7): 517−535 doi: 10.1002/(SICI)1096-9128(199609)8:7<517::AID-CPE226>3.0.CO;2-W
|
[3] |
Intel. Intel oneAPI math kernel library[EB/OL]. [2023-05-21]. http://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
|
[4] |
Zhang Xianyi. OpenBLAS: An optimized BLAS library[EB/OL]. [2023-06-13]. http://www.openblas.net/
|
[5] |
AMD. AMDLIB (AMD math library): A high-performance mathematical function library[EB/OL]. [2023-06-29]. https://www.amd.com/en/developer/aocl/libm.html
|
[6] |
McNamee J M. Algorithm 408: A sparse matrix package (part I)[J]. Communications of the ACM, 1971, 14(4): 265−273 doi: 10.1145/362575.362584
|
[7] |
Dongarra J J, Du C J, Hammarling S, et al. Algorithm 679: A set of level 3 basic linear algebra subprograms: Model implementation and test programs[J]. ACM Transactions on Mathematical Software, 1990, 16(1): 18−28 doi: 10.1145/77626.77627
|
[8] |
Hanson R. PC basic linear algebra subroutines[EB/OL]. [2023-06-15]. https://www.osti.gov/biblio/1230151
|
[9] |
Choi J, Walker D W, Dongarra J J. PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers.[J]. Concurrency: Practice and Experience, 1994, 6(7): 543−570 doi: 10.1002/cpe.4330060702
|
[10] |
Dongarra J J. Preface: Basic linear algebra subprograms technical (blast) forum standard[J]. The International Journal of High Performance Computing Applications, 2002, 16(1): 1−1 doi: 10.1177/10943420020160010101
|
[11] |
Chan E, Van Zee F G, Bientinesi P, et al. Supermatrix: A multithreaded runtime scheduling system for algorithms-by-blocks[C]//Proc of the 13th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2008: 123−132
|
[12] |
Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra[C/OL]//Proc of the 2008 ACM/IEEE Conf on Supercomputing(SC’08). Piscataway, NJ: IEEE, 2008[2023-06-20]. https://ieeexplore.ieee.org/document/5214359
|
[13] |
Soorishetty A. Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture[M]. Arizona: Arizona State University, 2019
|
[14] |
Wei Cunyang, Jia Haipeng, Zhang Yunquan, et al. IATF: An input-aware tuning framework for compact BLAS based on ARMv8 CPUs[C/OL]//Proc of the 51st Int Conf on Parallel Processing. New York: ACM, 2022[2023-06-25].https://dl.acm.org/doi/10.1145/3545008.3545032
|
[15] |
Valero-Lara P, Greenwalt C, Vetter J S. SparseLU, a novel algorithm and math library for sparse LU factorization[C]//Proc of the 2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3). Piscataway, NJ: IEEE, 2022: 25−31
|
[16] |
胡怡, 陈道琨, 杨超, 等. 国产SW26010-Pro处理器上3级BLAS函数众核并行优化[J]. 软件学报,2023,35(3):1569−1584
Hu Yi, Chen Daokun, Yang Chao, et al. Many-core parallel optimization of level–3 BLAS function on domestic SW26010-Pro processor[J]. Journal of Software, 2023, 35(3): 1569−1584 (in Chinese)
|
[17] |
Whaley R C, Petitet A, Dongarra J J. Automated empirical optimizations of software and the ATLAS project[J]. Parallel Computing, 2001, 27(1/2): 3−25
|
[18] |
Lu H, Matz M, Girkar M, et al. System V application binary interface AMD64 architecture processor supplement (with LP64 and ILP32 programming models) version 1.0[EB/OL]. [2023-06-28]. https://cs61.seas.harvard.edu/site/pdf/X86-64-abi-20210928.pdf
|
[19] |
Amiri H, Shahbahrami A. SIMD programming using Intel vector extensions[J]. Journal of Parallel and Distributed Computing, 2020, 135: 83−100 doi: 10.1016/j.jpdc.2019.09.012
|
[20] |
Kusswurm D. Modern X86 Assembly Language Programming: Covers X86 64-bit, AVX, AVX2, and AVX−512[M]. Berkeley, CA: Apress, 2018: 421−432
|
[21] |
Li Chendi, Jia Haipeng, Cao hang, et al. Autotsmm: An auto-tuning framework for building high-performance tall-and-skinny matrix-matrix multiplication on CPUs[C]//Proc of the 2021 IEEE Intel Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). Piscataway, NJ: IEEE, 2021: 159−166
|
[22] |
Yang Weiling, Fang Jianbin, Dong Dezun, et al. LIBSHALOM: Optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores[C/OL]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021[2023-06-26].https://dl.acm.org/doi/10.1145/3458817.3476217
|
[23] |
Sorokin A, Malkovsky S, Tsoy G. Comparing the performance of general matrix multiplication routine on heterogeneous computing systems[J]. Journal of Parallel and Distributed Computing, 2022, 160: 39−48 doi: 10.1016/j.jpdc.2021.10.002
|