神威超级计算机运行时故障定位方法

高剑刚; 郑岩; 于康; 彭达佳; 李宏亮; 刘勇; 何王全; 陈德训; 王飞

doi:10.7544/issn1000-1239.202220821

神威超级计算机运行时故障定位方法

Runtime Fault Location Method for Sunway Supercomputer

摘要

摘要: 随着高性能计算机的性能不断提升、系统规模不断提高，系统和应用的错误率也不可避免地持续增多. 快速发现和定位系统及应用级的错误、为用户提供高质量服务，成为了超级计算机系统设计开发过程中急需考虑的问题. 超级计算机系统中硬件故障与异常、软件程序的错误等都会导致用户大规模并行应用的错误、挂死与退出. 如何快速准确定位错误现场，让管理员或用户以此为基础查看异常发生的故障进行高精度、高效率的诊断，是维护高性能计算系统可靠性的重要基础. 高性能计算机传统的故障定位主要通过硬件异常跟踪、系统日志分析和程序主动探测等方法，缺乏对无日志信息、无明显故障现象的程序挂死问题的定位手段，并且技术的扩展性也面临挑战. 针对“新一代神威超级计算机”体系结构和SW26010-Pro众核处理器特点，提出一种运行时故障定位方法，包括基于消息传递的故障关联分析、基于全局聚合信息的在线综合分析诊断、面向申威众核处理器的异常线程过滤方法等关键技术，阐述了如何有效检测、收集、处理大量系统资源和并行进程的异常信息问题，为应对未来超大规模高性能计算中故障高效定位难题提供有效支撑.

Abstract: In order to meet the needs of scientific research and engineering applications, the performance of high-performance computers has been continuously improved, the system scale has continued to increase, and the error rate of systems and applications has inevitably continued to increase. Quickly discovering and locating system and application-level errors and providing high-quality services to users have become issues that need to be considered urgently in the design and development of supercomputer systems. Hardware failures and exceptions, software program errors, etc. in supercomputer systems will cause users to hang up and exit large-scale parallel applications. How to quickly and accurately locate the fault site, so that administrators or users can view abnormal faults for high-precision and high-efficiency diagnosis based on this, is an important basis for maintaining the reliability of high-performance computing systems. According to the architecture of “New Generation Sunway Supercomputer” and the characteristics of SW26010-Pro many-core processor, a runtime fault location method is proposed, including fault correlation analysis based on message passing, online comprehensive analysis and diagnosis based on global aggregated information, The key technologies such as abnormal thread filtering method for Shenwei many-core processors are expounded on how to effectively detect, collect, and process abnormal information of a large number of system resources and parallel processes, so as to deal with the problem of efficient fault location system in ultra-large-scale high-performance computing in the future.

HTML全文

参考文献(20)

施引文献

资源附件(0)