Research and Optimization of the Winograd-Based Convolutional Algorithm on ShenWei-26010 Many-Core Processor

Wu Zheng; Jin Xu; An Hong

doi:10.7544/issn1000-1239.202220787

Journal of Computer Research and Development > 2024 > 61(4): 955-972. > DOI: 10.7544/issn1000-1239.202220787 CSTR: 32373.14.issn1000-1239.202220787

Wu Zheng, Jin Xu, An Hong. Research and Optimization of the Winograd-Based Convolutional Algorithm on ShenWei-26010 Many-Core Processor[J]. Journal of Computer Research and Development, 2024, 61(4): 955-972. DOI: 10.7544/issn1000-1239.202220787

Citation:

PDF (2381 KB)

Research and Optimization of the Winograd-Based Convolutional Algorithm on ShenWei-26010 Many-Core Processor

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026

Funds: This work was supported by the National Key Research and Development Program of China(2018YFB0204102).

More Information

Author Bio:
Wu Zheng: born in 1992. PhD. His main research interests include parallel computer system architecture and machine learning

Jin Xu: born in 1992. PhD. His main research interests include machine learning and distributed system

An Hong: born in 1963. PhD, professor, PhD supervisor. Her main research interests include chip multiprocessor architecture, parallel computer system architecture, parallel programming environment and tools, and large data parallel storage and processing
Received Date: August 31, 2022
Revised Date: May 21, 2023
Available Online: November 13, 2023

Graphical Abstract

Abstract

Abstract

As a critical component, convolution is frequently applied in deep learning. The parallel algorithms of convolution have always been a popular research topic in high-performance computing. With the rapid development of Chinese homegrown ShenWei-26010 many-core processor in artificial intelligence, there is an urgent demand for high-performance convolutional algorithms on the processor. We propose an efficient convolutional design, that is, the fused Winograd-based convolutional algorithm, towarding ShenWei-26010 architectural characteristics and the computational features of Winograd-based convolution. Unlike the traditional Winograd-based convolutional algorithm that depends on the official GEMM (general matrix multiplication) library interface, the proposed algorithm owns the customized matrix multiplication implementation. The feature makes the execution process of the proposed algorithm visible, which can better adapt to common convolutions in reality. The proposed algorithm is composed of four parts: input Winograd transformation, filter Winograd transformation, core operation, and output Winograd inverse transformation. The four parts are fused together instead of executing each part separately. The core operation can gain the required transformed data in real time. Subsequently, the computational results are transformed inversely to the final output immediately. The fused execution improves the data locality of the proposed algorithm to reduce the memory access overhead significantly. Moreover, We design other optimization methods to enhance the performance of the proposed algorithm, such as merged Winograd-transformed mode, DMA (direct memory access) double buffering, the enhanced usage of on-chip storage, the elastic processing of output data tiles, and instruction reordering. The experiments show the performance of the proposed algorithm is 7.8 times that of the traditional Winograd-based convolutional algorithm on VGG network model. Moreover, we extract the common convolution from multiple typical convolutional neural networks to measure the hardware efficiency. The results show the proposed algorithm can significantly overperform the traditional Winograd-based convolutional algorithm for all the convolution cases. The best performance of the proposed algorithm is 116.21% of the theoretical peak performance of ShenWei-26010 processor, and the average one can reach 93.14%.
- deep learning,
- Winograd-based convolution,
- high-performance computing,
- parallel algorithm,
- ShenWei processor

FullText(HTML)

References (30)

References

[1]	Khan S, Rahmani H, Shah S A A, et al. A guide to convolutional neural networks for computer vision[J]. Synthesis Lectures on Computer Vision, 2018, 8(1): 1−207 doi: 10.1007/978-3-031-01821-3
[2]	Abdel-Hamid O, Mohamed A, Hui Jiang, et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533−1545 doi: 10.1109/TASLP.2014.2339736
[3]	Ouyang Zhenchao, Niu Jianwei, Liu Yu, et al. Deep CNN-based real-time traffic light detector for self-driving vehicles[J]. IEEE Transactions on Mobile Computing, 2019, 19(2): 300−313
[4]	Litjens G, Kooi T, Bejnordi B E, et al. A survey on deep learning in medical image analysis[J]. Medical Image Analysis, 2017, 42(9): 60−88
[5]	Zhang Yi, Shu Bing, Yin Yan, et al. Efficient processing of convolutional neural networks on SW26010 [C] //Proc of the 16th IFIP Int Conf on Network and Parallel Computing. Berlin: Springer, 2019: 316−321
[6]	Xu Rui, Ma Sheng, Guo Yang. Performance analysis of different convolution algorithms in GPU environment [C] //Proc of the 13th IEEE Int Conf on Networking, Architecture and Storage. Piscataway, NJ: IEEE, 2018: 45−54
[7]	San Juan P, Castelló A, Dolz M F, et al. High performance and portable convolution operators for multicore processors [C] //Proc of the 32nd IEEE Int Symp on Computer Architecture and High Performance Computing. Piscataway, NJ: IEEE, 2020: 91−98
[8]	Fang Jiarui, Fu Haohuan, Zhao Wenlai, et al. swdnn: A library for accelerating deep learning applications on Sunway Taihulight [C] //Proc of the 31st IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2017: 615−624
[9]	Nguyen-Thanh N, Le-Duc H, Ta D T, et al. Energy efficient techniques using FFT for deep convolutional neural networks [C] // Proc of the 9th of Int Conf on Advanced Technologies for Communications. Piscataway, NJ: IEEE, 2016: 231−236
[10]	Lavin A, Gray S. Fast algorithms for convolutional neural networks [C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4013−4021
[11]	Park H, Kim D, Ahn J, et al. Zero and data reuse-aware fast convolution for deep neural networks on GPU [C] //Proc of the 11th IEEE/ACM/IFIP Int Conf on Hardware/Software Codesign and System Synthesis. Piscataway, NJ: IEEE, 2016: 271−280
[12]	Jia Zhen, Zlateski A, Durand F, et al. Optimizing N-dimensional, Winograd-based convolution for manycore CPUs [C] //Proc of the 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2018: 109−123
[13]	武铮,安虹,金旭,等. 基于Intel平台的Winograd快速卷积算法研究与优化[J]. 计算机研究与发展,2019,56(4):825−835 doi: 10.7544/issn1000-1239.2019.20170932 Wu Zheng, An Hong, Jin Xu, et al. Research and optimization of fast convolution algorithm Winograd on Intel platform[J]. Journal of Computer Research and Development, 2019, 56(4): 825−835 (in Chinese) doi: 10.7544/issn1000-1239.2019.20170932
[14]	Mazaheri A, Beringer T, Moskewicz M, et al. Accelerating Winograd convolutions using symbolic computation and meta-programming [C] //Proc of the 15th European Conf on Computer Systems. New York: ACM, 2020: 616−629
[15]	Jia Liancheng, Liang Yun, Li Xiuhong, et al. Enabling efficient fast convolution algorithms on GPUs via MegaKernels[J]. IEEE Transactions on Computers, 2020, 69(7): 986−997
[16]	Castro R L, Andrade D, Fraguela B B. OpenCNN: A Winograd minimal filtering algorithm implementation in CUDA[J]. Mathematics, 2021, 9(17): 1−19
[17]	王庆林,李东升,梅松竹,等. 面向飞腾多核处理器的Winograd快速卷积算法优化[J]. 计算机研究与发展,2020,57(6):1140−1151 doi: 10.7544/issn1000-1239.2020.20200107 Wang Qinglin, Li Dongsheng, Mei Songzhu, et al. Optimizing Winograd-based fast convolution algorithm on Phytium multi-core CPUs[J]. Journal of Computer Research and Development, 2020, 57(6): 1140−1151 (in Chinese) doi: 10.7544/issn1000-1239.2020.20200107
[18]	Meng Jintao, Zhuang Chen, Chen Peng, et al. Automatic generation of high-performance convolution kernels on ARM CPUs for deep learning[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(11): 2885−2899 doi: 10.1109/TPDS.2022.3146257
[19]	Lu Liqiang, Liang Yun. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs [C/OL] //Proc of the 55th Annual Design Automation Conf. Piscataway, NJ: IEEE, 2018[2022-08-21].https://dl.acm.org/doi/abs/10.1145/3195970.3196120
[20]	Zhao Wenlai, Fu Haohuan, Fang Jiarui, et al. Optimizing convolutional neural networks on the Sunway Taihulight supercomputer[J]. ACM Transactions on Architecture and Code Optimization, 2018, 15(1): 310−335
[21]	Fu Haohuan, Liao Junfeng, Yang Jinzhe, et al. The Sunway TaihuLight supercomputer: System and applications[J]. Science China Information Sciences, 2016, 59(7): 1−16
[22]	Lin J, Xu Zhigeng, Cai Linjin, et al. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations[J]. Parallel Computing, 2018, 77(3): 128−143
[23]	Xu Zhigeng, Lin J, Matsuoka S. Benchmarking SW26010 many-core processor [C] //Proc of the 31st IEEE Int Parallel and Distributed Processing Symp Workshops. Piscataway, NJ: IEEE, 2017: 743−752
[24]	Intel. Intel Xeon Phi processor [EB/OL]. [2022-08-21]. https://software.intel.com/en-us/xeon-phi/x200-processor
[25]	NVIDIA. NVIDIA TESLA V100 [EB/OL]. [2022-08-21].https://www.nvidia.com/en-gb/data-center/tesla-v100/
[26]	Wu Zheng, Li Mingfan, Chi Mengxian, et al. Runtime adaptive matrix multiplication for the SW26010 many-core processor[J]. IEEE Access, 2020, 8: 156915−156928 doi: 10.1109/ACCESS.2020.3019302
[27]	Li Xiuhong, Liang Yun, Yan Shengen, et al. A coordinated tiling and batching framework for efficient GEMM on GPUs [C] //Proc of the 24th Symp on Principles and Practice of Parallel Programming. New York: ACM, 2019: 229−241
[28]	Li Fang, Liu Xin, Liu Yong, et al. SW_Qsim: A minimize-memory quantum simulator with high-performance on a new sunway supercomputer [C/OL] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021[2022-08-21].https://dl.acm.org/doi/10.1145/3458817.3476161
[29]	Chen Xin, Gao Yingxiang, Shang Honghui, et al. Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(12): 4752−4766 doi: 10.1109/TPDS.2022.3202518
[30]	Jiang Lijuan, Yang Chao, Ao Yulong, et al. Towards highly efficient DGEMM on the emerging SW26010 many-core processor [C] //Proc of the 46th Int Conf on Parallel Processing. Piscataway, NJ: IEEE, 2017: 422−431