Advanced Search
    Xiang Taoran, Ye Xiaochun, Li Wenming, Feng Yujing, Tan Xu, Zhang Hao, Fan Dongrui. Accelerating Fully Connected Layers of Sparse Neural Networks with Fine-Grained Dataflow Architectures[J]. Journal of Computer Research and Development, 2019, 56(6): 1192-1204. DOI: 10.7544/issn1000-1239.2019.20190117
    Citation: Xiang Taoran, Ye Xiaochun, Li Wenming, Feng Yujing, Tan Xu, Zhang Hao, Fan Dongrui. Accelerating Fully Connected Layers of Sparse Neural Networks with Fine-Grained Dataflow Architectures[J]. Journal of Computer Research and Development, 2019, 56(6): 1192-1204. DOI: 10.7544/issn1000-1239.2019.20190117

    Accelerating Fully Connected Layers of Sparse Neural Networks with Fine-Grained Dataflow Architectures

    • Deep neural network (DNN) is a hot and state-of-the-art algorithm which is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its high computational complexity, many efficient hardware accelerators have been proposed to exploit high degree of parallel processing for DNN. However, the fully connected layers in DNN have a large number of weight parameters, which imposes high requirements on the bandwidth of the accelerator. In order to reduce the bandwidth pressure of the accelerator, some DNN compression algorithms are proposed. But accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption, making it difficult to accelerate sparse neural networks. Other accelerators, such as GPUs, are general enough, but they lead to higher power consumption. Fine-grained dataflow architectures, which break conventional Von Neumann architectures, show natural advantages in processing DNN-like algorithms with high computational efficiency and low power consumption. At the same time, it remains broadly applicable and adaptable. In this paper, we propose a scheme to accelerate the sparse DNN fully connected layers on a hardware accelerator based on fine-grained dataflow architecture. Compared with the original dense fully connected layers, the scheme reduces the peak bandwidth requirement of 2.44×~ 6.17×. In addition, the utilization of the computational resource of the fine-grained dataflow accelerator running the sparse fully-connected layers far exceeds the implementation by other hardware platforms, which is 43.15%, 34.57%, and 44.24% higher than the CPU, GPU, and mGPU, respectively.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return