• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Wang Qinglin, Li Dongsheng, Mei Songzhu, Lai Zhiquan, Dou Yong. Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs[J]. Journal of Computer Research and Development, 2020, 57(6): 1140-1151. DOI: 10.7544/issn1000-1239.2020.20200107
Citation: Wang Qinglin, Li Dongsheng, Mei Songzhu, Lai Zhiquan, Dou Yong. Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs[J]. Journal of Computer Research and Development, 2020, 57(6): 1140-1151. DOI: 10.7544/issn1000-1239.2020.20200107

Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs

Funds: This work was supported by the National Science and Technology Major Projects of Hegaoji (2018ZX01028101).
More Information
  • Published Date: May 31, 2020
  • Convolutional neural networks (CNNs) have been extensively used in artificial intelligence fields such as computer vision and natural language processing. Winograd-based fast convolution algorithms can effectively reduce the computational complexity of convolution operations in CNNs so that they have attracted great attention. With the application of Phytium multi-core CPUs independently developed by the National University of Defense Technology in artificial intelligence fields, there is strong demand of high-performance convolution primitives for Phytium multi-core CPUs. This paper proposes a new high-performance parallel Winograd-based fast convolution algorithm after studying architecture characteristics of Phytium multi-core CPUs and computing characteristics of Winograd-based fast convolution algorithms. The new parallel algorithm does not rely on general matrix multiplication routines, and consists of four stages: kernels transformation, input feature maps transformation, element-wise multiplication, and output feature maps inverse transformation. The data movements in all four stages have been collaboratively optimized to improve memory access performance of the algorithm. The custom data layouts, multi-level parallel data transformation algorithms and multi-level parallel matrix multiplication algorithm have also been proposed to support the optimization above efficiently. The algorithm is tested on two Phytium multi-core CPUs. Compared with Winograd-based fast convolution implementations in ARM Computer Library (ACL) and NNPACK, the algorithm can achieve speedup of 1.05~16.11 times and 1.66~16.90 times, respectively. The application of the algorithm in the open source framework Mxnet improves the forward-propagation performance of the VGG16 network by 3.01~6.79 times.
  • Related Articles

    [1]Guo Jiang, Wang Miao, Zhang Yujun. Content Type Based Jumping Probability Caching Mechanism in NDN[J]. Journal of Computer Research and Development, 2021, 58(5): 1118-1128. DOI: 10.7544/issn1000-1239.2021.20190871
    [2]Li Li, Liu Huanyu, Lu Laifeng. Probabilistic Caching Content Placement Method Based on Content-Centrality[J]. Journal of Computer Research and Development, 2020, 57(12): 2648-2661. DOI: 10.7544/issn1000-1239.2020.20190704
    [3]Su Wen, Zhang Longbing, Gao Xiang, Su Menghao. A Cache Locking and Direct Cache Access Based Network Processing Optimization Method[J]. Journal of Computer Research and Development, 2014, 51(3): 681-690.
    [4]Zhao Xinjie, Wang Tao, Guo Shize, Liu Huiying. Cache Attacks on Block Ciphers[J]. Journal of Computer Research and Development, 2012, 49(3): 453-468.
    [5]Zhao Xinjie, Wang Tao, Guo Shize, Liu Huiying. Cache Attacks on Block Ciphers[J]. Journal of Computer Research and Development, 2012, 49(3): 453-468.
    [6]Jia Yaocang, Wu Chenggang, Zhang Zhaoqing. Program’s Performance Profiling Optimization for Guiding Static Cache Partitioning[J]. Journal of Computer Research and Development, 2012, 49(1): 93-102.
    [7]Xiao Junhua, Feng Zijun, Zhang Longbing. The Tradeoff Cache Between Latency and Capacity in Chip Multiprocessors[J]. Journal of Computer Research and Development, 2009, 46(1): 167-175.
    [8]Gao Xiang, Zhang Longbing, Hu Weiwu. A CapacityShared Heterogeneous CMP Cache[J]. Journal of Computer Research and Development, 2008, 45(5): 877-885.
    [9]Zhou Qian, Feng Xiaobing, and Zhang Zhaoqing. Software Pipelining with Cache Profiling Information[J]. Journal of Computer Research and Development, 2008, 45(5): 834-840.
    [10]Huan Dandan, Li Zusong, Hu Weiwu, Liu Zhiyong. A Cache Adaptive Write Allocate Policy[J]. Journal of Computer Research and Development, 2007, 44(2): 348-354.
  • Cited by

    Periodical cited type(6)

    1. 赵迪,赵祖高,何克勤,聂磊. 混杂条件下的三维点云目标识别. 组合机床与自动化加工技术. 2023(06): 58-62 .
    2. 赵迪,赵祖高,程煜林,聂磊. 多特征关键点的自适应尺度融合特征点云配准. 电子测量技术. 2023(10): 68-75 .
    3. 孙昊. 基于改进随机森林的海量高维数据最近邻检索. 自动化技术与应用. 2022(11): 73-76 .
    4. 孟祥福,王丹丹,张霄雁,贾江浩. Top-k集合空间关键字近似查询方法. 计算机工程与应用. 2022(23): 104-116 .
    5. 宋涛,曹利波,赵明富,刘帅,罗宇航,杨鑫. 三维点云中关键点的配准与优化算法. 激光与光电子学进展. 2021(04): 375-383 .
    6. 孟祥福,王丹丹,张峰. 空间关键字查询综述. 计算机工程与应用. 2021(20): 13-24 .

    Other cited types(10)

Catalog

    Article views (1667) PDF downloads (708) Cited by(16)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return