ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (6): 1140-1151.doi: 10.7544/issn1000-1239.2020.20200107

Special Issue: 2020计算机体系结构前沿技术专题

Previous Articles     Next Articles

Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs

Wang Qinglin, Li Dongsheng, Mei Songzhu, Lai Zhiquan, Dou Yong   

  1. (Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073) (College of Computer, National University of Defense Technology, Changsha 410073)
  • Online:2020-06-01
  • Supported by: 
    This work was supported by the National Science and Technology Major Projects of Hegaoji (2018ZX01028101).

Abstract: Convolutional neural networks (CNNs) have been extensively used in artificial intelligence fields such as computer vision and natural language processing. Winograd-based fast convolution algorithms can effectively reduce the computational complexity of convolution operations in CNNs so that they have attracted great attention. With the application of Phytium multi-core CPUs independently developed by the National University of Defense Technology in artificial intelligence fields, there is strong demand of high-performance convolution primitives for Phytium multi-core CPUs. This paper proposes a new high-performance parallel Winograd-based fast convolution algorithm after studying architecture characteristics of Phytium multi-core CPUs and computing characteristics of Winograd-based fast convolution algorithms. The new parallel algorithm does not rely on general matrix multiplication routines, and consists of four stages: kernels transformation, input feature maps transformation, element-wise multiplication, and output feature maps inverse transformation. The data movements in all four stages have been collaboratively optimized to improve memory access performance of the algorithm. The custom data layouts, multi-level parallel data transformation algorithms and multi-level parallel matrix multiplication algorithm have also been proposed to support the optimization above efficiently. The algorithm is tested on two Phytium multi-core CPUs. Compared with Winograd-based fast convolution implementations in ARM Computer Library (ACL) and NNPACK, the algorithm can achieve speedup of 1.05~16.11 times and 1.66~16.90 times, respectively. The application of the algorithm in the open source framework Mxnet improves the forward-propagation performance of the VGG16 network by 3.01~6.79 times.

Key words: multi-core CPUs, deep learning, convolutional neural networks, Winograd algorithms, parallel algorithms

CLC Number: