Citation: | Xia Tian, Fu Gelin, Qu Shaoru, Luo Zhongpei, Ren Pengju. Optimization of Parallel Computation on Sparse Matrix-Vector Multiplication with High Predictability[J]. Journal of Computer Research and Development, 2023, 60(9): 1973-1987. DOI: 10.7544/issn1000-1239.202330421 |
Sparse matrix-vector multiplication (SpMV) has been widely applied in scientific computation, industry simulation and intelligent computation domains, which is the critical algorithm in all these applications. Usually, iterative computation of SpMV is required to fulfill precise numeric simulation, linear algebra solving and graph analytics requirements. However, due to the poor data locality, low cache usage and extreme irregular computation patterns caused by the highly sparse and random distributions, SpMV optimization has become one of the most challenging problems for modern high-performance processors. In this paper, we study the bottlenecks of SpMV on current out-of-order CPUs and propose to improve its performance by pursuing high predictability and low program complexity. Specifically, we improve the memory access regularity and locality by creating serialized access patterns so that the data prefetching efficiency and cache usage are optimized. We also improve the pipeline efficiency by creating regular branch patterns to make the branch prediction more accurate. Meanwhile, we flexibly lever the SIMD instructions to optimize the parallel execution and fully use CPU’s computation resources. Experimental results show that using the above optimization approaches, our SpMV kernel is effective to significantly alleviate the critical bottlenecks and improve the efficiency of CPU pipeline, cache and memory bandwidth usage. The resulting performance achieves average 2.6 times speedup against Intel’s commercial library of MKL, as well as average 1.3 times speedup against the existing best SpMV algorithm.
[1] |
Sarıyüce E A, Saule E, Kaya K, et al. Regularizing graph centrality computations[J]. Journal of Parallel and Distributed Computing, 2015, 76: 106−119 doi: 10.1016/j.jpdc.2014.07.006
|
[2] |
Malewicz G, Austern H M, Bik J C A, et al. Pregel: A system for large-scale graph processing[C]//Proc of ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2010: 135–146
|
[3] |
Lin Li, Liao Xiaofei, Tan Guang, et al. LiveRender: A cloud gaming system based on compressed graphics streaming[C]//Proc of the 22nd ACM Int Conf on Multimedia. New York: ACM, 2014: 347–356
|
[4] |
Hua Qiangsheng, Fan Haoqiang, Qian Lixiang, et al. Brief announcement: A tight distributed algorithm for all pairs shortest paths and applications[C]//Proc of the 28th ACM Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2016: 439–441
|
[5] |
谢震, 谭光明, 孙凝晖. 基于PPR模型的稀疏矩阵向量乘及卷积性能优化研究[J]. 计算机研究与发展, 2021,58(3): 445−457
Xie Zhen, Tan Guangming, Sun Ninghui. Research on optimal performance of sparse matrix-vector multiplication and convoulution using the probability-process-RAM model[J]. Journal of Computer Research and Development, 2021, 58(3): 445−457(in Chinese)
|
[6] |
Wang Endong, Zhang Qing, Shen Bo, et al. Intel Math Kernel Library[M]//High-Performance Computing on the Intel® Xeon Phi™. Beilin: Springer, 2014: 167–188
|
[7] |
Liu Xing, Smelyankiy M, Chow E, et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors[C]//Proc of the 27th Int ACM Conf on Int Conf on Supercomputing. New York: ACM, 2013: 273–282
|
[8] |
Liu Weifeng, Vinter B. CSR5: An efficient storage format for cross platform sparse matrix-vector multiplication[C]//Proc of the 29th Int Conf on Supercomputing. New York: ACM, 2015: 339–350
|
[9] |
Xie Biwei, Zhan Jianfeng, Liu Xu, et al. CVR: Efficient vectorization of spmv on x86 processors[C]//Proc of the 2018 Int Symp on Code Generation and Optimization. New York: ACM, 2018: 149–162
|
[10] |
Buluc A, Fineman T J, Frigo M, et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proc of the 21st Annual Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2009: 233–244
|
[11] |
Choi W J, Singh A, Vuduc W R. Model-driven autotuning of sparse matrix-vector multiply on GPUs[C]//Proc of the 15th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2010: 115–126
|
[12] |
Liang Yun, Tang TengWai, Zhao Ruizhe, et al. Scale-free sparse matrix-vector multiplication on many-core architectures[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(12): 2106−2119 doi: 10.1109/TCAD.2017.2681072
|
[13] |
Im E J, Yelick K, Vuduc R. Sparsity: Optimization framework for sparse matrix kernels[J]. International Journal of High Performance Applications, 2004, 18(1): 135−158 doi: 10.1177/1094342004041296
|
[14] |
Hong Changwan, Sukumaran-Rajam A, Nisa I, et al. Adaptive sparse tiling for sparse matrix multiplication[C]//Proc of the 24th Symp on Principles and Practice of Parallel Programming. New York: ACM, 2019: 300–314
|
[15] |
Davis A T, Hu Yifan. The University of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1−25
|
[16] |
Dyksen R W, Ribbens J C. Interactive ELLPACK: An interactive problem-solving environment for elliptic partial differential equations[J]. ACM Transactions on Mathematical Software, 1987, 13(2): 113−132 doi: 10.1145/328512.328515
|
[17] |
Kourtis K, Karakasis V, Goumas G, et al. CSX: An extended compression format for SpMV on shared memory systems[C]//Proc of the 16th ACM Symp on Principles and Practice of Parallel Programming. New York: ACM, 2011: 247–256
|
[18] |
Ashari A, Sedaghati N, Eisenlohr J, et al. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs[C]//Proc of the 28th ACM Int Conf on Supercomputing. New York: ACM, 2014: 273–282
|
[19] |
Chen Tien-Fu, Baer L J. Effective hardware-based data prefetching for high-performance processors[J]. IEEE Transactions on Computers, 1995, 44(5): 609−623 doi: 10.1109/12.381947
|
[20] |
Xia Tian, Fu Gelin, Li Chenyang, et al. A comprehensive performance model of sparse matrix-vector multiplication to guide kernel optimization[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(2): 519−534 doi: 10.1109/TPDS.2022.3225230
|
[21] |
Karkhanis S T, Smith E J. A first-order superscalar processor model[C]//Proc of 31st Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2004: 338–349
|
[22] |
Tang Tengwai, Zhao Ruizhe, Lu Mian et al. Optimizing and auto-tuning scale-free sparse matrix vector multiplication on Intel Xeon Phi[C]// Proc of IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2015: 136–145
|
[23] |
Seznez A. A new case for the TAGE branch predictor[C]//Proc of the 44th Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2011: 117–127
|
[24] |
Bian Haodong, Huang Jianqiang, Dong Runting et al. CSR2: A new format for SIMD-accelerated SpMV[C]//Proc of IEEE/ACM Int Symp on Cluster, Cloud and Internet Computing (CCGRID). Piscataway, NJ: IEEE, 2020: 350−359.
|
[1] | Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962 |
[2] | Wang Mengru, Yao Yunzhi, Xi Zekun, Zhang Jintian, Wang Peng, Xu Ziwen, Zhang Ningyu. Safety Analysis of Large Model Content Generation Based on Knowledge Editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143-1155. DOI: 10.7544/issn1000-1239.202330965 |
[3] | Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801 |
[4] | Chen Huimin, Liu Zhiyuan, Sun Maosong. The Social Opportunities and Challenges in the Era of Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094-1103. DOI: 10.7544/issn1000-1239.202330700 |
[5] | Yang Yi, Li Ying, Chen Kai. Vulnerability Detection Methods Based on Natural Language Processing[J]. Journal of Computer Research and Development, 2022, 59(12): 2649-2666. DOI: 10.7544/issn1000-1239.20210627 |
[6] | Pan Xuan, Xu Sihan, Cai Xiangrui, Wen Yanlong, Yuan Xiaojie. Survey on Deep Learning Based Natural Language Interface to Database[J]. Journal of Computer Research and Development, 2021, 58(9): 1925-1950. DOI: 10.7544/issn1000-1239.2021.20200209 |
[7] | Zheng Haibin, Chen Jinyin, Zhang Yan, Zhang Xuhong, Ge Chunpeng, Liu Zhe, Ouyang Yike, Ji Shouling. Survey of Adversarial Attack, Defense and Robustness Analysis for Natural Language Processing[J]. Journal of Computer Research and Development, 2021, 58(8): 1727-1750. DOI: 10.7544/issn1000-1239.2021.20210304 |
[8] | Pan Xudong, Zhang Mi, Yan Yifan, Lu Yifan, Yang Min. Evaluating Privacy Risks of Deep Learning Based General-Purpose Language Models[J]. Journal of Computer Research and Development, 2021, 58(5): 1092-1105. DOI: 10.7544/issn1000-1239.2021.20200908 |
[9] | Bao Yang, Yang Zhibin, Yang Yongqiang, Xie Jian, Zhou Yong, Yue Tao, Huang Zhiqiu, Guo Peng. An Automated Approach to Generate SysML Models from Restricted Natural Language Requirements in Chinese[J]. Journal of Computer Research and Development, 2021, 58(4): 706-730. DOI: 10.7544/issn1000-1239.2021.20200757 |
[10] | Che Haiyan, Feng Tie, Zhang Jiachen, Chen Wei, and Li Dali. Automatic Knowledge Extraction from Chinese Natural Language Documents[J]. Journal of Computer Research and Development, 2013, 50(4): 834-842. |