高级检索

    一种异构系统下计算软件性能数据采集方法

    A Performance Data Collection Method for Computing Software in Heterogeneous Systems

    • 摘要: 超级计算已从传统CPU 集群向异构平台快速发展,随着硬件平台的类型转换,对于计算软件程序调优及性能测评等都面临着重大挑战. 当前一些国际主流并行程序性能分析工具及软件普遍存在与国产超算异构系统处理器产品兼容性低,往往需要进行插桩及重编译代码的方式,且单节点性能数据采集准确度不高等问题. 为了改进这些不足,提出了一种异构系统计算软件浮点性能数据采集方法. 该方法基于国产超算系统验证平台对浮点性能采集原型进行开发及验证. 目前已实现单节点和多节点性能指标数据的有效采集,且对原程序无侵入性,无需修改需要被监控程序的代码进行插桩方式进行监控,通用性强. 最后,与rocHPL,Cannon,mixbench这3类程序进行对比实验分析,并针对人工智能(artificial intelligence,AI)计算,在残差网络(residual network,ResNet)程序上开展了性能数据采集方面的监测研究. 证明提出的采集方法准确度较高,采集效果达到实验预期,且对程序调优具有较好的参考价值,验证了该方法的有效性.

       

      Abstract: Supercomputing has rapidly developed from traditional CPU clusters to heterogeneous platforms. With the type conversion of hardware platforms, it faces significant challenges in optimizing computing software programs and performance evaluation. Currently, some international mainstream parallel program performance analysis tools and software generally have low compatibility with domestic supercomputing heterogeneous system processor products, often requiring instrumentation and recompilation of code, and low accuracy in single node performance data collection. To improve these shortcomings, this article proposes a floating-point performance data collection method for heterogeneous system computing software. This method is based on the domestic supercomputing system verification platform to develop and verify the floating-point performance collection prototype. At present, effective collection of single node and multi node performance indicator data has been achieved, and it is non-invasive to the original program. There is no need to modify the code of the monitored program for monitoring in a plug-in manner, making it highly versatile. Finally, we conducted comparative experimental analysis with three types of programs: rocHPL, Cannon, and mixbench, and conducted performance data collection monitoring research on ResNet (residual network, ResNet) program for AI computing. We have demonstrated that the collection method proposed in this article has high accuracy, achieves the expected collection effect in experiments, and has good reference value for program optimization, verifying the effectiveness of the proposed method.

       

    /

    返回文章
    返回