面向飞腾多核处理器的Winograd快速卷积算法优化

王庆林; 李东升; 梅松竹; 赖志权; 窦勇

doi:10.7544/issn1000-1239.2020.20200107

面向飞腾多核处理器的Winograd快速卷积算法优化

Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs

摘要

摘要: 随着深度学习的快速发展，卷积神经网络已广泛应用于计算机视觉、自然语言处理等人工智能领域中.Winograd快速卷积算法因能有效降低卷积神经网络中卷积操作的计算复杂度而受到广泛关注.随着国防科技大学自主研制的飞腾多核处理器在智能领域的推广应用，对面向飞腾多核处理器的高性能卷积实现提出了强烈需求.针对飞腾多核处理器的体系结构特征与Wingorad快速卷积算法的计算特点，提出了一种高性能并行Winograd快速卷积算法.该算法不依赖通用矩阵乘库函数，由卷积核转换、输入特征图转换、逐元素乘、输出特征图逆变换等4个部分构成，融合设计了4个部分的数据操作，并设计了与之配套的数据布局、多级并行数据转换算法与多级并行矩阵乘算法，实现访存性能以及算法整体性能的提升.在两款飞腾多核处理器上的测试结果显示，与开源库ACL和NNPACK中的Winograd快速卷积实现相比，该算法分别能获得1.05~16.11倍与1.66~16.90倍的性能加速；集成到开源框架Mxnet后，该算法使得VGG16网络的前向计算获得了3.01~6.79倍的性能加速.

Abstract: Convolutional neural networks (CNNs) have been extensively used in artificial intelligence fields such as computer vision and natural language processing. Winograd-based fast convolution algorithms can effectively reduce the computational complexity of convolution operations in CNNs so that they have attracted great attention. With the application of Phytium multi-core CPUs independently developed by the National University of Defense Technology in artificial intelligence fields, there is strong demand of high-performance convolution primitives for Phytium multi-core CPUs. This paper proposes a new high-performance parallel Winograd-based fast convolution algorithm after studying architecture characteristics of Phytium multi-core CPUs and computing characteristics of Winograd-based fast convolution algorithms. The new parallel algorithm does not rely on general matrix multiplication routines, and consists of four stages: kernels transformation, input feature maps transformation, element-wise multiplication, and output feature maps inverse transformation. The data movements in all four stages have been collaboratively optimized to improve memory access performance of the algorithm. The custom data layouts, multi-level parallel data transformation algorithms and multi-level parallel matrix multiplication algorithm have also been proposed to support the optimization above efficiently. The algorithm is tested on two Phytium multi-core CPUs. Compared with Winograd-based fast convolution implementations in ARM Computer Library (ACL) and NNPACK, the algorithm can achieve speedup of 1.05~16.11 times and 1.66~16.90 times, respectively. The application of the algorithm in the open source framework Mxnet improves the forward-propagation performance of the VGG16 network by 3.01~6.79 times.

HTML全文

参考文献(0)

施引文献

资源附件(0)