高级检索

    申威26010众核处理器上Winograd卷积算法的研究与优化

    Research and Optimization of the Winograd-Based Convolutional Algorithm on ShenWei-26010 Many-Core Processor

    • 摘要: 卷积作为深度学习中被频繁使用的关键部分,其并行算法的研究已成为高性能计算领域中的热门话题. 随着我国自主研发的申威26010众核处理器在人工智能领域的快速发展,对面向该处理器的高性能并行卷积算法提出了迫切的需求. 针对申威26010处理器的架构特征以及Winograd卷积算法的计算特性,提出了一种高性能并行卷积算法——融合Winograd卷积算法. 该算法不同于依赖官方GEMM(general matrix multiplication)库接口的传统Winograd卷积算法,定制的矩阵乘实现使得该算法的执行过程变得可见,且能够更好地适应现实中常见卷积运算. 整个算法由输入的Winograd变换、卷积核的Winograd变换、核心运算和输出的Winograd逆变换4部分构成,这4个部分并不是单独执行而是融合到一起执行. 通过实时地为核心运算提供需要的变换后数据,并将计算结果及时地逆变换得到最终的输出数据,提高了算法执行过程中的数据局部性,极大地降低了整体的访存开销. 同时,为该算法设计了合并的Winograd变换模式、DMA(direct memory access)双缓冲、片上存储的强化使用、输出数据块的弹性处理以及指令重排等优化方案. 最终的实验结果表明,在VGG网络模型的总体卷积测试中,该算法性能是传统Winograd卷积算法的7.8倍. 同时,抽取典型卷积神经网络模型中的卷积进行测试,融合Winograd卷积算法能够在所有的卷积场景中发挥明显高于传统Winograd卷积算法的性能. 其中,最大能够发挥申威26010处理器峰值性能的116.21%,平均能够发挥峰值性能的93.14%.

       

      Abstract: As a critical component, convolution is frequently applied in deep learning. The parallel algorithms of convolution have always been a popular research topic in high-performance computing. With the rapid development of Chinese homegrown ShenWei-26010 many-core processor in artificial intelligence, there is an urgent demand for high-performance convolutional algorithms on the processor. We propose an efficient convolutional design, that is, the fused Winograd-based convolutional algorithm, towarding ShenWei-26010 architectural characteristics and the computational features of Winograd-based convolution. Unlike the traditional Winograd-based convolutional algorithm that depends on the official GEMM (general matrix multiplication) library interface, the proposed algorithm owns the customized matrix multiplication implementation. The feature makes the execution process of the proposed algorithm visible, which can better adapt to common convolutions in reality. The proposed algorithm is composed of four parts: input Winograd transformation, filter Winograd transformation, core operation, and output Winograd inverse transformation. The four parts are fused together instead of executing each part separately. The core operation can gain the required transformed data in real time. Subsequently, the computational results are transformed inversely to the final output immediately. The fused execution improves the data locality of the proposed algorithm to reduce the memory access overhead significantly. Moreover, We design other optimization methods to enhance the performance of the proposed algorithm, such as merged Winograd-transformed mode, DMA (direct memory access) double buffering, the enhanced usage of on-chip storage, the elastic processing of output data tiles, and instruction reordering. The experiments show the performance of the proposed algorithm is 7.8 times that of the traditional Winograd-based convolutional algorithm on VGG network model. Moreover, we extract the common convolution from multiple typical convolutional neural networks to measure the hardware efficiency. The results show the proposed algorithm can significantly overperform the traditional Winograd-based convolutional algorithm for all the convolution cases. The best performance of the proposed algorithm is 116.21% of the theoretical peak performance of ShenWei-26010 processor, and the average one can reach 93.14%.

       

    /

    返回文章
    返回