ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (6): 1140-1151.doi: 10.7544/issn1000-1239.2020.20200107

所属专题: 2020计算机体系结构前沿技术专题

• 系统结构 • 上一篇    下一篇

面向飞腾多核处理器的Winograd快速卷积算法优化

王庆林,李东升,梅松竹,赖志权,窦勇   

  1. (国防科技大学并行与分布处理国防科技重点实验室 长沙 410073) (国防科技大学计算机学院 长沙 410073) (wangqinglin@nudt.edu.cn)
  • 出版日期: 2020-06-01
  • 基金资助: 
    “核高基”国家科技重大专项基金项目(2018ZX01028101)

Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs

Wang Qinglin, Li Dongsheng, Mei Songzhu, Lai Zhiquan, Dou Yong   

  1. (Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073) (College of Computer, National University of Defense Technology, Changsha 410073)
  • Online: 2020-06-01
  • Supported by: 
    This work was supported by the National Science and Technology Major Projects of Hegaoji (2018ZX01028101).

摘要: 随着深度学习的快速发展,卷积神经网络已广泛应用于计算机视觉、自然语言处理等人工智能领域中.Winograd快速卷积算法因能有效降低卷积神经网络中卷积操作的计算复杂度而受到广泛关注.随着国防科技大学自主研制的飞腾多核处理器在智能领域的推广应用,对面向飞腾多核处理器的高性能卷积实现提出了强烈需求.针对飞腾多核处理器的体系结构特征与Wingorad快速卷积算法的计算特点,提出了一种高性能并行Winograd快速卷积算法.该算法不依赖通用矩阵乘库函数,由卷积核转换、输入特征图转换、逐元素乘、输出特征图逆变换等4个部分构成,融合设计了4个部分的数据操作,并设计了与之配套的数据布局、多级并行数据转换算法与多级并行矩阵乘算法,实现访存性能以及算法整体性能的提升.在两款飞腾多核处理器上的测试结果显示,与开源库ACL和NNPACK中的Winograd快速卷积实现相比,该算法分别能获得1.05~16.11倍与1.66~16.90倍的性能加速;集成到开源框架Mxnet后,该算法使得VGG16网络的前向计算获得了3.01~6.79倍的性能加速.

关键词: 多核CPU, 深度学习, 卷积神经网络, Winograd算法, 并行算法

Abstract: Convolutional neural networks (CNNs) have been extensively used in artificial intelligence fields such as computer vision and natural language processing. Winograd-based fast convolution algorithms can effectively reduce the computational complexity of convolution operations in CNNs so that they have attracted great attention. With the application of Phytium multi-core CPUs independently developed by the National University of Defense Technology in artificial intelligence fields, there is strong demand of high-performance convolution primitives for Phytium multi-core CPUs. This paper proposes a new high-performance parallel Winograd-based fast convolution algorithm after studying architecture characteristics of Phytium multi-core CPUs and computing characteristics of Winograd-based fast convolution algorithms. The new parallel algorithm does not rely on general matrix multiplication routines, and consists of four stages: kernels transformation, input feature maps transformation, element-wise multiplication, and output feature maps inverse transformation. The data movements in all four stages have been collaboratively optimized to improve memory access performance of the algorithm. The custom data layouts, multi-level parallel data transformation algorithms and multi-level parallel matrix multiplication algorithm have also been proposed to support the optimization above efficiently. The algorithm is tested on two Phytium multi-core CPUs. Compared with Winograd-based fast convolution implementations in ARM Computer Library (ACL) and NNPACK, the algorithm can achieve speedup of 1.05~16.11 times and 1.66~16.90 times, respectively. The application of the algorithm in the open source framework Mxnet improves the forward-propagation performance of the VGG16 network by 3.01~6.79 times.

Key words: multi-core CPUs, deep learning, convolutional neural networks, Winograd algorithms, parallel algorithms

中图分类号: