Abstract:
As a critical component, convolution is frequently applied to deep learning. The parallel algorithms of convolution have always been a popular research topic in high-performance computing. With the rapid development of the Chinese homegrown ShenWei-26010 many-core processor in artificial intelligence, there is an urgent demand for high-performance convolutional algorithms on the processor. This paper proposes a efficient convolutional design, the fused Winograd-based convolutional algorithm, toward the ShenWei-26010 architectural characteristics and the computational features of the Winograd-based convolution. Unlike the traditional Winograd-based convolutional algorithm that depends on the official GEMM(general matrix multiplication) library interface, the proposed algorithm owns the customized matrix multiplication implementation. The feature makes the execution process of the proposed algorithm visible, which can better adapt to common convolutions in reality. The proposed algorithm is composed of four parts: input Winograd transformation, filter Winograd transformation, core operation, and output Winograd inverse transformation. The four parts are fused together instead of executing each part separately. The core operation can gain the required transformed data in real time. Subsequently, the computational results are transformed inversely to the final output immediately. The fused execution improves the data locality of the proposed algorithm to reduce the memory access overhead significantly. Moreover, this paper designs other optimization methods to enhance the performance of the proposed algorithm, such as merged Winograd-transformed mode, DMA(Direct Memory Access) double buffering, the enhanced usage of on-chip storage, the elastic processing of output data tiles, and instruction reordering. The experiments show the performance of the proposed algorithm is 7.8 times that of the traditional Winograd-based convolutional algorithm on the VGG network model. Moreover, this paper extracts the common convolution from multiple typical convolutional neural networks to measure the hardware efficiency. The results show the proposed algorithm can overperform significantly the traditional Winograd-based convolutional algorithm for all the convolution cases. The best performance of the proposed algorithm is 116.21% of the theoretical peak performance of the ShenWei-26010 processor, and the average one can reach 93.14%.