ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (6): 1249-1262.doi: 10.7544/issn1000-1239.2016.20148354

Previous Articles     Next Articles

Workload Analysis for Typical GPU Programs Using CUPTI Interface

Zheng Zhen, Zhai Jidong, Li Yan, Chen Wenguang   

  1. (Department of Computer Science and Technology, Tsinghua University, Beijing 100084)
  • Online:2016-06-01

Abstract: GPU-based high performance computers have become an important trend in the area of high performance computing. However, developing efficient parallel programs on current GPU devices is very complex because of the complex memory hierarchy and thread hierarchy. To address this problem, we summarize five kinds of key metrics that reflect the performance of programs according to the hardware and software architecture. Then we design and implement a performance analysis tool based on underlying CUPTI interfaces provided by NVIDIA, which can collect key metrics automatically without modifying the source code. The tool can analyze the performance behaviors of GPU programs effectively with very little impact on the execution of programs. Finally, we analyze 17 programs in Rodinia benchmark, which is a famous benchmark for GPU programs, and a real application using our tool. By analyzing the value of key metrics, we find the performance bottlenecks of each program and map the bottlenecks back to source code. These analysis results can be used to guide the optimization of CUDA programs and GPU architecture. Result shows that most bottlenecks come from inefficient memory access, and include unreasonable global memory and shared memory access pattern, and low concurrency for these programs. We summarize the common reasons for typical performance bottlenecks and give some high-level suggestions for developing efficient GPU programs.

Key words: graphics processing unit (GPU), workload analysis, Rodinia, performance counter, performance metric

CLC Number: