面向GPU计算平台的神经网络卷积性能优化

李茂文; 曲国远; 魏大洲; 贾海鹏

doi:10.7544/issn1000-1239.20200985

面向GPU计算平台的神经网络卷积性能优化

Performance Optimization of Neural Network Convolution Based on GPU Platform

摘要

摘要: 图像检测、识别任务已经被应用在越来越多的生产生活场景中，基于卷积神经网络的方法凭借着精度高的特点被广泛应用.但是卷积神经网络存在着权重参数多、对算力要求高的问题，算力有限且型号多样的边缘计算设备使得这些应用在使用中受限.在跨平台上运行高性能代码，以及基于GPU的卷积神经网络优化愈发重要.针对卷积神经网络中的卷积规模和其他通用矩阵乘(general matrix multiplication, GEMM)方法的不足，根据分块规模、分支执行、访存和计算比例，提出了一种针对卷积神经网络规模优化的GEMM优化方法，将其应用于Winograd算法，并结合算子合并，实现对卷积进一步优化.同时基于遍历的自调优选择性能最优的卷积算子，结合离线编译、内存池、16 b量化、网络规模裁剪等方法，来提升卷积神经网络的性能.最后在AMD V1605B平台上进行实验验证算法的效果，通过和其他GEMM算法以及深度学习网络的性能进行对比，验证了该方法能够获得比GEMM算法和Winograd算法更好的加速效果，并能有效地加速卷积神经网络.

Abstract: Image detection and recognition tasks have been applied in more and more production and life scenarios. The convolution-based neural network method is widely used because of its high accuracy. However, the convolution neural network has the problems of many weight parameters and high computational requirements, which are limited by the limited computational power and the variety of edge computing devices. Running high-performance codes across platforms, convolutional neural network optimization based on GPU is increasingly important. In view of the insufficiency of convolution scale and other GEMM methods in convolutional neural network, we present a GEMM optimization method for convolutional neural network size optimization based on block size, branch execution, memory access and calculation scale, which can be applied to Wingrad algorithm and operator combination to further optimize convolution. At the same time, the convolution operator with the best performance is selected based on traversal self-tuning, combining offline compilation, memory pool, 16 b quantization, network scale clipping, etc. to improve the performance of convolutional neural network. Finally, experiments are carried out on AMD V1605B platform to verify the effectiveness of the algorithm. By comparing with other GEMM algorithms and deep learning networks, it is verified that this method can achieve better acceleration than GEMM and Winograd algorithms, and can effectively accelerate the convolutional neural network.

HTML全文

参考文献(0)

施引文献

资源附件(0)