• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Zheng Zhen, Zhai Jidong, Li Yan, Chen Wenguang. Workload Analysis for Typical GPU Programs Using CUPTI Interface[J]. Journal of Computer Research and Development, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354
Citation: Zheng Zhen, Zhai Jidong, Li Yan, Chen Wenguang. Workload Analysis for Typical GPU Programs Using CUPTI Interface[J]. Journal of Computer Research and Development, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354

Workload Analysis for Typical GPU Programs Using CUPTI Interface

More Information
  • Published Date: May 31, 2016
  • GPU-based high performance computers have become an important trend in the area of high performance computing. However, developing efficient parallel programs on current GPU devices is very complex because of the complex memory hierarchy and thread hierarchy. To address this problem, we summarize five kinds of key metrics that reflect the performance of programs according to the hardware and software architecture. Then we design and implement a performance analysis tool based on underlying CUPTI interfaces provided by NVIDIA, which can collect key metrics automatically without modifying the source code. The tool can analyze the performance behaviors of GPU programs effectively with very little impact on the execution of programs. Finally, we analyze 17 programs in Rodinia benchmark, which is a famous benchmark for GPU programs, and a real application using our tool. By analyzing the value of key metrics, we find the performance bottlenecks of each program and map the bottlenecks back to source code. These analysis results can be used to guide the optimization of CUDA programs and GPU architecture. Result shows that most bottlenecks come from inefficient memory access, and include unreasonable global memory and shared memory access pattern, and low concurrency for these programs. We summarize the common reasons for typical performance bottlenecks and give some high-level suggestions for developing efficient GPU programs.
  • Cited by

    Periodical cited type(10)

    1. 陶蔚,陇盛,刘鑫,胡亚豪,黄金才. 深度学习步长自适应动量优化方法研究综述. 小型微型计算机系统. 2025(02): 257-265 .
    2. 张泽东,陇盛,鲍蕾,陶卿. 基于AdaBelief的Heavy-Ball动量方法. 模式识别与人工智能. 2022(02): 106-115 .
    3. 陇盛,陶蔚,张泽东,陶卿. 基于AdaGrad的自适应NAG方法及其最优个体收敛性. 软件学报. 2022(04): 1231-1243 .
    4. 曲军谊. 基于对偶平均的动量方法研究综述. 计算机与数字工程. 2022(11): 2443-2448 .
    5. 曲军谊,鲍蕾,陶卿. 非光滑凸问题投影型对偶平均优化方法的个体收敛性. 模式识别与人工智能. 2021(01): 25-32 .
    6. 黄鉴之,陇盛,陶卿. 自适应策略下Heavy-Ball型动量法的最优个体收敛速率. 模式识别与人工智能. 2021(02): 137-145 .
    7. 李兴怡,岳洋. 梯度下降算法研究综述. 软件工程. 2020(02): 1-4 .
    8. 丁成诚,陶蔚,陶卿. 一种三参数统一化动量方法及其最优收敛速率. 计算机研究与发展. 2020(08): 1571-1580 . 本站查看
    9. 鲁淑霞,蔡莲香,张罗幻. 基于动量加速零阶减小方差的鲁棒支持向量机. 计算机工程. 2020(12): 88-95+104 .
    10. 黄鉴之,丁成诚,陶蔚,陶卿. 非光滑凸情形Adam型算法的最优个体收敛速率. 智能系统学报. 2020(06): 1140-1146 .

    Other cited types(4)

Catalog

    Article views (1883) PDF downloads (822) Cited by(14)
    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return