Citation: | Xu Chuanfu, Qiu Haozhong, Che Yonggang. Optimizing Sequences of Sparse Matrix-Vector Multiplications via Cache Data Reuse[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550125 |
Many high performance computing (HPC) applications such as solving sparse linear systems involve the calculation of sequences of sparse matrix-vector multiplications (SpMV) like Ax, A2x, …, Asx, which is known as the sparse matrix power kernel (MPK). Since MPK calls SpMV using the same sparse matrix, reusing the matrix elements in cache, instead of repeatedly loading them from main memory, can potentially alleviate the memory-constrained problem for SpMV and enhance the performance of MPK. However, reusing the matrix introduces data dependencies between subsequent SpMVs. Prior work mainly either focuses on optimizing an individual SpMV invocation, or introduces significant overheads for cache data reuse in MPK. We propose a cache-aware MPK (Ca-MPK) to optimize MPK via cache data reuse. Based on the dependency graph of a sparse matrix, an architecture-aware recursive partitioning of the graph is designed to obtain subgraphs/submatrices which are fit into cache. Separating subgraphs (i.e., separators) are constructed to decouple data dependencies among subgraphs. Then cache data reuse is achieved by executing sequences of SpMVs on subgraphs with a specified order. Performance evaluation demonstrates that Ca-MPK outperforms the MPK implementations based on the Intel OneMKL library and the state-of-the-art approach, with an average speedup of up to about 1.57 times and 1.40 times, respectively.
[1] |
Zhang Yichen, Li Shengguo, Yuan Fan, et al. Memory-aware optimization for sequences of sparse matrix-vector multiplications[C]//Proc of 2023 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2023: 379−389
|
[2] |
Alappat C, Hager G, Schenk O, et al. Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 34(2): 581−597
|
[3] |
Gurhem J, Vandromme M, Tsuji M, et al. Sequences of sparse matrix-vector multiplication on Fugaku’s A64FX processors[C]//Proc of 2021 IEEE Int Conf on Cluster Computing (CLUSTER). Piscataway, NJ: IEEE, 2021: 751−758
|
[4] |
Saad Y. Numerical Methods for Large Eigenvalue Problems: Revised Edition[M]. Philadelphia, PA: SIAM, 2011
|
[5] |
Filippone S, Cardellini V, Barbieri D, et al. Sparse matrix-vector multiplication on GPGPUS[J]. ACM Transactions on Mathematical Software, 2017, 43(4): 1−49
|
[6] |
Barrett R, Berry M, Chan T F, et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods[M]. Philadelphia, PA: SIAM, 1994
|
[7] |
Vuduc R W. Automatic Performance Tuning of Sparse Matrix Kernels[M]. Berkeley, CA: University of California, 2003
|
[8] |
Buluç A, Fineman J T, Frigo M, et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proc of the 21st Annual Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2009: 233−244
|
[9] |
Demmel J, Hoemmen M, Mohiyuddin M, et al. Avoiding communication in computing Krylov subspaces[R]. Berkeley, CA: University of California, 2010
|
[10] |
Hoemmen M. Communication-Avoiding Krylov Subspace Methods[M]. Berkeley, CA: University of California, 2007
|
[11] |
Carson E C. Communication-Avoiding Krylov Subspace Methods in Theory and Practice[M]. Berkeley, CA: University of California, 2015
|
[12] |
Dongarra J, Tomov S, Luszczek P, et al. With extreme computing, the rules have changed[J]. Computing in Science & Engineering, 2017, 19(3): 52−62
|
[13] |
Mohiyuddin M, Hoemmen M, Demmel J, et al. Minimizing communication in sparse matrix solvers[C]//Proc of the Conf on High Performance Computing Networking, Storage and Analysis. New York: ACM, 2009: 1−12
|
[14] |
Morlan J, Kamil S, Fox A. Auto-tuning the matrix powers kernel with SEJITS[C]//Proc of 10th Int Conf High Performance Computing for Computational Science-VECPAR 2012. Berlin: Springer, 2013: 391−403
|
[15] |
Muranushi T, Makino J. Optimal temporal blocking for stencil computation[J]. Procedia Computer Science, 2015, 51(1): 1303−1312
|
[16] |
Huber D, Schreiber M, Schulz M. Graph-based multi-core higher-order time integration of linear autonomous partial differential equations[J]. Journal of Computational Science, 2021, 53(4): 101−139
|
[17] |
Alappat C, Basermann A, Bishop A R, et al. A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication[J]. ACM Transactions on Parallel Computing, 2020, 7(3): 1−37
|
[18] |
Qiu Haozhong, Xu Chuanfu, Fang Jianbin, et al. Towards scalable unstructured mesh computations on shared memory many-cores[C]//Proc of the 29th ACM SIGPLAN Annual Symp on Principles and Practice of Parallel Programming. New York: ACM, 2024: 109−119
|
[19] |
Naumov M. S-step and communication-avoiding iterative methods[R]. Santa Clara, CA: NVIDIA, 2016
|
[20] |
Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs[J]. SIAM Journal on Scientific Computing, 1998, 20(1): 359−392 doi: 10.1137/S1064827595287997
|
[21] |
Intel Corporation. Intel oneAPI math kernel library (onemkl)[EB/OL]. Santa Clara: Intel Corporation, 2024[2024-03-25]. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.
|
[22] |
Davis T A, Hu Y. The University of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1−25
|
[23] |
Treibig J, Hager G, Wellein G. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments[C]//Proc of 2010 39th Int Conf on Parallel Processing Workshops. Piscataway, NJ: IEEE, 2010: 207−216
|
[1] | Xia Tian, Fu Gelin, Qu Shaoru, Luo Zhongpei, Ren Pengju. Optimization of Parallel Computation on Sparse Matrix-Vector Multiplication with High Predictability[J]. Journal of Computer Research and Development, 2023, 60(9): 1973-1987. DOI: 10.7544/issn1000-1239.202330421 |
[2] | Xie Zhen, Tan Guangming, Sun Ninghui. Research on Optimal Performance of Sparse Matrix-Vector Multiplication and Convoulution Using the Probability-Process-Ram Model[J]. Journal of Computer Research and Development, 2021, 58(3): 445-457. DOI: 10.7544/issn1000-1239.2021.20180601 |
[3] | Feng Da, Zhou Fucai, Wang Qiang, Wu Qiyu. Efficient Verifiable Outsourcing of Solving Large-Scale Linear Equations with Low Storage Overhead[J]. Journal of Computer Research and Development, 2019, 56(5): 1123-1131. DOI: 10.7544/issn1000-1239.2019.20180191 |
[4] | Zheng Jianwei, Yang Ping, Wang Wanliang, Bai Cong. Kernel Sparse Representation Classification with Group Weighted Constraints[J]. Journal of Computer Research and Development, 2016, 53(11): 2567-2582. DOI: 10.7544/issn1000-1239.2016.20150743 |
[5] | Li Weibang, Li Zhanhuai, Chen Qun, Jiang Tao, Liu Hailong, Pan Wei. Functional Dependencies Discovering in Distributed Big Data[J]. Journal of Computer Research and Development, 2015, 52(2): 282-294. DOI: 10.7544/issn1000-1239.2015.20140229 |
[6] | Cui Zhen, Shan Shiguang, Chen Xilin. Structured Sparse Linear Discriminant Analysis[J]. Journal of Computer Research and Development, 2014, 51(10): 2295-2301. DOI: 10.7544/issn1000-1239.2014.20130188 |
[7] | Feng Lin, Liu Shenglan, Zhang Jing, and Wang Huibing. Robust Activation Function of Extreme Learning Machine and Linear Dimensionality Reduction in High-Dimensional Data[J]. Journal of Computer Research and Development, 2014, 51(6): 1331-1340. |
[8] | Li Jiajia, Zhang Xiuxia, Tan Guangming, Chen Mingyu. Study of Choosing the Optimal Storage Format of Sparse Matrix Vector Multiplication[J]. Journal of Computer Research and Development, 2014, 51(4): 882-894. |
[9] | Hu Changjun, Wei Shuo, Zhang Jilin, and Wang Jue. A Parallel SOR Algorithm for Linear Systems on SMP[J]. Journal of Computer Research and Development, 2007, 44(10): 1688-1693. |
[10] | Ma Zhiqiang, Ji Zhenzhou, and Hu Mingzeng. A Low Power Data Cache Design Based on Very Narrow-Width Value[J]. Journal of Computer Research and Development, 2007, 44(5): 775-781. |