Citation: | Yan Zhiyuan, Xie Biwei, Bao Yungang. HVMS: A Hybrid Vectorization-Optimized Mechanism of SpMV[J]. Journal of Computer Research and Development, 2024, 61(12): 2969-2984. DOI: 10.7544/issn1000-1239.202330204 |
Sparse matrix-vector multiplication (SpMV) plays an important role in a wide variety of scientific and engineering applications.Executing SpMV efficiently on vector processors is challenging because of the irregular memory access of sparse matrices. We conduct a detailed analysis of the main factors that have impact on the efficiency of SpMV vectorization.In addition to the irregular distribution of non-zero elements within sparse matrices, different sparse matrices also exhibit huge variations in the distribution characteristics of non-zero elements. Therefore, it is difficult to apply a universal vector optimization method for matrices with diverse characteristics. Furthermore, there is a big difference in vector computing and vector instructions for various vector processors. The primary challenge of SpMV vectorization lies in mapping the irregular sparse matrices onto the regular vector processor. In this paper, we propose a hybrid vectorization-optimized mechanism of SpMV(HVMS). HVMS models the characteristics of vector processors and designs corresponding rules based on the abstracted basic operations to guide the regularization conversion of sparse matrices. HVMS divides the matrix into different parts according to the different characteristics. For each part, the non-zero distribution can be less irregular and then HVMS introduces corresponding optimization mechanisms to boost the vectorization efficiency of SpMV. We implement and evaluate HVMS on an Intel Xeon processor and compare it with three state-of-the-art approaches using 30 sparse matrices. Experimental results show that HVMS can achieve an average speedup of 1.60x, 1.72x, and 1.93x over CVR, SELL-C-σ, and Intel MKL, respectively.
[1] |
Zilli G. Iterative methods for solving sparse linear systems with a parallel preconditioner[J]. International Journal of Computer Mathematics, 1992, 44(1/2/3/4): 111−119
|
[2] |
Diaz J, Munoz-Caro C, Nino A. A survey of parallel programming models and tools in the multi and many-core era[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1369−1386 doi: 10.1109/TPDS.2011.308
|
[3] |
Chrysos G. Intel® Xeon PhiTM coprocessor-the architecture[EB/OL]. 2014[2023-07-31]. http://gec.di.uminho.pt/minf/cpd1314/SCD/Intel_Xeon-PhiArch.pdf
|
[4] |
Anderson C S, Zhang Jingwei, Cornea M. Enhanced vector math support on the Intel® avx-512 architecture[C]//Proc of the 25th Symp on Computer Arithmetic (ARITH). Piscataway, NJ: IEEE, 2018: 120−124
|
[5] |
Reddy V G. NEON technology introduction[J]. ARM Corporation, 2008, 4(1): 1−33
|
[6] |
Wang Endong, Zhang Qing, Shen Bo, et al. Intel math kernel library[EB/OL]. 2014[2023-06-13].https://link.springer.com/chapter/10.1007/978-3-319-06486-4_7
|
[7] |
Liu Weifeng, Vinter B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication[C]//Proc of the 29th Int Conf on Supercomputing. New York: ACM, 2015: 339−350
|
[8] |
Xie Biwei, Zhan Jianfeng, Liu Xu, et al. CVR: Efficient vectorization of SpMV on x86 processors[C]//Proc of the 16th Annual IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2018: 149−162
|
[9] |
Kreutzer M, Hager G, Wellein G, et al. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units[J]. SIAM Journal on Scientific Computing, 2014, 36(5): C401−C423 doi: 10.1137/130930352
|
[10] |
Flynn M J. Some computer organizations and their effectiveness[J]. IEEE Transactions on Computers, 1972, 21(9): 948−960
|
[11] |
Lomont C. Introduction to Intel advanced vector extensions[EB/OL]. 2011[2023-06-13].https://hpc.llnl.gov/sites/default/files/intelAVXintro.pdf
|
[12] |
Oberman S, Favor G, Weber F. AMD 3DNow! Technology: Architecture and implementations[J]. IEEE Micro, 1999, 19(2): 37−48 doi: 10.1109/40.755466
|
[13] |
Luebke D, Harris M, Govindaraju N, et al. GPGPU: General-purpose computation on graphics hardware[C/OL]//Proc of the 2006 ACM/IEEE Conf on Supercomputing. New York: ACM, 2006[2023-08-04].https://dl.acm.org/doi/10.1145/1188455.1188672
|
[14] |
Davis T A, Hu Yifan. The University of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1−25
|
[15] |
Li Chenyang, Xia Tian, Zhao Wenzhe, et al. SpV8: Pursuing optimal vectorization and regular computation pattern in SpMV[C]//Proc of the 58th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2021: 661−666
|
[16] |
Bian Haodong, Huang Jianqiang, Liu Lingbin, et al. ALBUS: A method for efficiently processing SpMV using SIMD and load balancing[J]. Future Generation Computer Systems, 2021, 116: 371−392 doi: 10.1016/j.future.2020.10.036
|
[17] |
Bian Haodong, Huang Jianqiang, Dong Runting, et al. CSR2: A new format for SIMD-accelerated SpMV[C]//Proc of the 20th IEEE/ACM Int Symp on Cluster, Cloud and Internet Computing (CCGRID). Piscataway, NJ: IEEE, 2020: 350−359
|
[18] |
Tang Waiteng, Zhao Ruizhe, Lu Mian, et al. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi[C]//Proc of the 13th Annual IEEE/ACM Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2015: 136−145
|
[19] |
Eberhardt R, Hoemmen M. Optimization of block sparse matrix-vector multiplication on shared-memory parallel architectures[C]//Proc of the 2016 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2016: 663−672
|
[20] |
Yan Shengen, Li Chao, Zhang Yunquan, et al. yaSpMV: Yet another SpMV framework on GPUS[J]. ACM SIGPLAN Notices, 2014, 49(8): 107−118 doi: 10.1145/2692916.2555255
|
[21] |
Merrill D, Garland M. Merge-based parallel sparse matrix-vector multiplication[C]//Proc of the 2016 Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2016: 678−689
|
[22] |
Yesil S, Heidarshenas A, Morrison A, et al. Speeding up SpMV for power-law graph analytics by enhancing locality & vectorization[C/OL]//Proc of the 2020 Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-07-31].https://ieeexplore.ieee.org/document/9355205
|
[23] |
Elafrou A, Goumas G, Koziris N. Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures[C/OL]//Proc of the 2019 Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2019[2023-07-31].https://dl.acm.org/doi/10.1145/3295500.3356148
|
[24] |
Fei Xiang, Zhang Youhui. Regu2D: Accelerating vectorization of SpMV on Intel processors through 2D-partitioning and regular arrangement[C/OL]//Proc of the 50th Int Conf on Parallel Processing. New York: ACM, 2021[2023-07-31].https://dl.acm.org/doi/10.1145/3472456.3472479
|
[25] |
Liu Changxi, Xie Biwei, Liu Xin, et al. Towards efficient SpMV on sunway manycore architectures[C]//Proc of the 2018 Int Conf on Supercomputing. New York: ACM, 2018: 363−373
|
[26] |
Buono D, Petrini F, Checconi F, et al. Optimizing sparse matrix-vector multiplication for large-scale data analytics[C/OL]//Proc of the 2016 Int Conf on Supercomputing. New York: ACM, 2016[2023-07-31].https://dl.acm.org/doi/10.1145/2925426.2926278
|
[27] |
Umuroglu Y, Jahre M. An energy efficient column-major backend for FPGA SpMV accelerators[C]//Proc of the 32nd Int Conf on Computer Design (ICCD). Piscataway, NJ: IEEE, 2014: 432−439
|
[28] |
Liu Xing, Smelyanskiy M, Chow E, et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors[C]//Proc of the 27th Int ACM Conf on Int Conf on Supercomputing. New York: ACM, 2013: 273−282
|
[29] |
Li Kenli, Yang Wangdong, Li Keqin. Performance analysis and optimization for SpMV on GPU using probabilistic modeling[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 26(1): 196−205
|
[30] |
Niu Yuyao, Lu Zhengyang, Dong Meichen, et al. TileSpMV: A tiled algorithm for sparse matrix-vector multiplication on GPUS[C]//Proc of the 2021 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2021: 68−78
|
[31] |
Greathouse J L, Daga M. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format[C]//Proc of the 2014 Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2014: 769−780
|
[32] |
Li Jiajia, Tan Guangming, Chen Mingyu, et al. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication[C]//Proc of the 34th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2013: 117−126
|
[33] |
Zhao Yue, Li Jiajia, Liao Chunhua, et al. Bridging the gap between deep learning and sparse matrix format selection[C]//Proc of the 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2018: 94−108
|
[34] |
Feng Siying, Sun Jiawen, Pal S, et al. Cosparse: A software and hardware reconfigurable SpMV framework for graph analytics[C]//Proc of the 58th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2021: 949−954
|
[35] |
Tang Waiteng, Zhao Ruizhe, Lu Mian, et al. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi[C]//Proc of the 13th Annual IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2015: 136−145
|
[36] |
Zhao Yue, Zhou Weijie, Shen Xipeng, et al. Overhead-conscious format selection for SpMV-based applications[C]//Proc of the 2018 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2018: 950−959
|
[37] |
Benatia A, Ji Weixing, Wang Yizhuo, et al. Sparse matrix format selection with multiclass svm for SpMV on GPU[C]//Proc of the 45th Int Conf on Parallel Processing (ICPP). Piscataway, NJ: IEEE, 2016: 496−505
|
[38] |
谢震,谭光明,孙凝晖. 基于PPR 模型的稀疏矩阵向量乘及卷积性能优化研究[J]. 计 算 机 研 究 与 发 展,2021,58(3):445−457
Xie Zhen, Tan Guangming, Sun Ninghui. Research on optimal performance of sparse matrix-vector multiplication and convoulution using the probability-process-RAM model[J]. Journal of Computer Research and Development, 2021, 58(3): 445−457(in Chinese)
|
[39] |
Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUS[J]. ACM SIGPLAN Notices, 2010, 45(5): 115−126 doi: 10.1145/1837853.1693471
|
[40] |
Vuduc R, Demmel J W, Yelick K A. OSKI: A library of automatically tuned sparse matrix kernels[J]. Journal of Physics: Conference Series. 2005, 16(1): 521−530
|
[1] | Lai Yuanming, Li Yalong, Hu Hanzhi, Xie Mengyao, Wang Zhe, Wu Chenggang. SIMD-RVV Dynamic Binary Translation Optimization: Redundant Configuration Elimination and Hybrid Translation-Driven Cross-Architecture Programming Model Adaptation[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550135 |
[2] | Guo Rongyuan, Jia Haipeng, Zhang Yunquan, Wei Cunyang, Deng Mingsen, Chen Jingrui, Zhou Zhenya. Small-Scale Irregular TRSM Implementation and Optimization[J]. Journal of Computer Research and Development, 2025, 62(2): 517-531. DOI: 10.7544/issn1000-1239.202330864 |
[3] | Wang Chuang, Ding Yan, Huang Chenlin, Song Liantao. Bitsliced Optimization of SM4 Algorithm with the SIMD Instruction Set[J]. Journal of Computer Research and Development, 2024, 61(8): 2097-2109. DOI: 10.7544/issn1000-1239.202220531 |
[4] | Zhang Zhendong, Wang Tong, Liu Peng. Rule Processing Optimization Technologies on the Sunway Many-Core Processor[J]. Journal of Computer Research and Development, 2024, 61(1): 66-85. DOI: 10.7544/issn1000-1239.202220656 |
[5] | Zhao Long, Han Wenbao, and Yang Hongzhi. Research on ECC Attacking Algorithm Based on SIMD Instructions[J]. Journal of Computer Research and Development, 2012, 49(7): 1553-1559. |
[6] | Wu Xiaodong, Han Jianjun, Wang Tianjiang. Energy-Aware Scheduling of Hard Real-Time Tasks in VFD-Based Multi-Core Systems[J]. Journal of Computer Research and Development, 2012, 49(5): 1018-1027. |
[7] | Wang Bo, Shang Shifeng, Wu Yongwei, and Zheng Weimin. Parallel Task Building in Multicore System[J]. Journal of Computer Research and Development, 2012, 49(4): 818-825. |
[8] | Lu Kai, Chi Wanqing, Gao Yinghui, Feng Hua. MIOS: A Scalable Multi-Instance OS for Large Scale CCNUMA System[J]. Journal of Computer Research and Development, 2011, 48(9): 1693-1703. |
[9] | He Yi, Ren Ju, Wen Mei, Yang Qianming, Wu Nan, Zhang Chunyuan, and Guo Min. Research on FPGA-Based Paging-Simulation Model for SIMD Architecture[J]. Journal of Computer Research and Development, 2011, 48(1): 9-18. |
[10] | Huang Shuangqu, Xiang Bo, Bao Dan, Chen Yun, and Zeng Xiaoyang. VLSI Implementation of Multi-Standard LDPC Decoder Based on SIMD Architecture[J]. Journal of Computer Research and Development, 2010, 47(7): 1313-1320. |