Abstract:
Supercomputing has rapidly developed from traditional CPU clusters to heterogeneous platforms. With the type conversion of hardware platforms, it faces significant challenges in optimizing computing software programs and performance evaluation. Currently, some international mainstream parallel program performance analysis tools and software generally have low compatibility with domestic supercomputing heterogeneous system processor products, often requiring instrumentation and recompilation of code, and low accuracy in single node performance data collection. To improve these shortcomings, this article proposes a floating-point performance data collection method for heterogeneous system computing software. This method is based on the domestic supercomputing system verification platform to develop and verify the floating-point performance collection prototype. At present, effective collection of single node and multi node performance indicator data has been achieved, and it is non-invasive to the original program. There is no need to modify the code of the monitored program for monitoring in a plug-in manner, making it highly versatile. Finally, we conducted comparative experimental analysis with three types of programs: rocHPL, Cannon, and mixbench, and conducted performance data collection monitoring research on ResNet (residual network, ResNet) program for AI computing. We have demonstrated that the collection method proposed in this article has high accuracy, achieves the expected collection effect in experiments, and has good reference value for program optimization, verifying the effectiveness of the proposed method.