高级检索
    郑祯, 翟季冬, 李焱, 陈文光. 基于CUPTI接口的典型GPU程序负载特征分析[J]. 计算机研究与发展, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354
    引用本文: 郑祯, 翟季冬, 李焱, 陈文光. 基于CUPTI接口的典型GPU程序负载特征分析[J]. 计算机研究与发展, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354
    Zheng Zhen, Zhai Jidong, Li Yan, Chen Wenguang. Workload Analysis for Typical GPU Programs Using CUPTI Interface[J]. Journal of Computer Research and Development, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354
    Citation: Zheng Zhen, Zhai Jidong, Li Yan, Chen Wenguang. Workload Analysis for Typical GPU Programs Using CUPTI Interface[J]. Journal of Computer Research and Development, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354

    基于CUPTI接口的典型GPU程序负载特征分析

    Workload Analysis for Typical GPU Programs Using CUPTI Interface

    • 摘要: 基于图形处理器(graphics processing unit, GPU)加速设备的高性能计算机已经成为目前高性能计算领域的一个重要发展趋势.然而,在当前的GPU设备上开发高效的并行程序仍然是一件非常复杂的事情.针对这一问题,1)总结了影响GPU程序性能的5类关键性能指标;2)采用NVIDIA公司提供的CUPTI底层接口,设计并实现了一套GPU程序性能分析工具集,该工具集可以有效地分析GPU程序的性能行为;3)采用该工具集对著名的GPU评测程序集Rodinia中的17个程序和一个真实应用程序进行了负载特征分析.总结出常见性能瓶颈的典型原因,并给出一些开发高效GPU程序的建议.

       

      Abstract: GPU-based high performance computers have become an important trend in the area of high performance computing. However, developing efficient parallel programs on current GPU devices is very complex because of the complex memory hierarchy and thread hierarchy. To address this problem, we summarize five kinds of key metrics that reflect the performance of programs according to the hardware and software architecture. Then we design and implement a performance analysis tool based on underlying CUPTI interfaces provided by NVIDIA, which can collect key metrics automatically without modifying the source code. The tool can analyze the performance behaviors of GPU programs effectively with very little impact on the execution of programs. Finally, we analyze 17 programs in Rodinia benchmark, which is a famous benchmark for GPU programs, and a real application using our tool. By analyzing the value of key metrics, we find the performance bottlenecks of each program and map the bottlenecks back to source code. These analysis results can be used to guide the optimization of CUDA programs and GPU architecture. Result shows that most bottlenecks come from inefficient memory access, and include unreasonable global memory and shared memory access pattern, and low concurrency for these programs. We summarize the common reasons for typical performance bottlenecks and give some high-level suggestions for developing efficient GPU programs.

       

    /

    返回文章
    返回