Abstract:
As a critical component, convolution is frequently applied in deep learning. The parallel algorithms of convolution have always been a popular research topic in high-performance computing. With the rapid development of Chinese homegrown ShenWei-26010 many-core processor in artificial intelligence, there is an urgent demand for high-performance convolutional algorithms on the processor. We propose an efficient convolutional design, that is, the fused Winograd-based convolutional algorithm, towarding ShenWei-26010 architectural characteristics and the computational features of Winograd-based convolution. Unlike the traditional Winograd-based convolutional algorithm that depends on the official GEMM (general matrix multiplication) library interface, the proposed algorithm owns the customized matrix multiplication implementation. The feature makes the execution process of the proposed algorithm visible, which can better adapt to common convolutions in reality. The proposed algorithm is composed of four parts: input Winograd transformation, filter Winograd transformation, core operation, and output Winograd inverse transformation. The four parts are fused together instead of executing each part separately. The core operation can gain the required transformed data in real time. Subsequently, the computational results are transformed inversely to the final output immediately. The fused execution improves the data locality of the proposed algorithm to reduce the memory access overhead significantly. Moreover, We design other optimization methods to enhance the performance of the proposed algorithm, such as merged Winograd-transformed mode, DMA (direct memory access) double buffering, the enhanced usage of on-chip storage, the elastic processing of output data tiles, and instruction reordering. The experiments show the performance of the proposed algorithm is 7.8 times that of the traditional Winograd-based convolutional algorithm on VGG network model. Moreover, we extract the common convolution from multiple typical convolutional neural networks to measure the hardware efficiency. The results show the proposed algorithm can significantly overperform the traditional Winograd-based convolutional algorithm for all the convolution cases. The best performance of the proposed algorithm is 116.21% of the theoretical peak performance of ShenWei-26010 processor, and the average one can reach 93.14%.