Advanced Search
    Zhang Jun, He Yanxiang, Shen Fanfan, Jiang Nan, Li Qing’an. Two-Stage Synchronization Based Thread Block Compaction Scheduling Method of GPGPU[J]. Journal of Computer Research and Development, 2016, 53(6): 1173-1185. DOI: 10.7544/issn1000-1239.2016.20150114
    Citation: Zhang Jun, He Yanxiang, Shen Fanfan, Jiang Nan, Li Qing’an. Two-Stage Synchronization Based Thread Block Compaction Scheduling Method of GPGPU[J]. Journal of Computer Research and Development, 2016, 53(6): 1173-1185. DOI: 10.7544/issn1000-1239.2016.20150114

    Two-Stage Synchronization Based Thread Block Compaction Scheduling Method of GPGPU

    • The application of general purpose graphics processing unit (GPGPU) has become increasingly extensive in the general purpose computing fields facing high performance computing and high throughput. The powerful computing capability of GPGPU comes from single instruction multiple data (SIMD) execution model it takes. Currently, it has become the main stream for GPGPU to implement the efficient execution of the computing tasks via massive high parallel threads. However the parallel computing capability is affected during dealing with the branch divergent control flow as different branch path is processed sequentially. In this paper, we propose TSTBC (two-stage synchronization based thread block compaction scheduling) method based on analyzing the previously proposed thread block compaction scheduling methods in inefficient dealing with divergent branches. This method analyzes the effectiveness of thread block compaction and reconstruction via taking the use of the adequacy decision logic of thread block compaction and decreases the number of inefficient thread block compaction. The simulation experiment results show that the effectiveness of thread block compaction and reconstruction is improved to some extent relative to the other same type of methods, and the destruction on data locality inside the thread group and the on-chip level-one data cache miss rate can be reduced effectively. The performance of the whole system is increased by 1927% over the baseline architecture.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return